# adversarial_support_alignment__f9696308.pdf Published as a conference paper at ICLR 2022 ADVERSARIAL SUPPORT ALIGNMENT Shangyuan Tong MIT CSAIL Timur Garipov MIT CSAIL Yang Zhang MIT-IBM Watson AI Lab Shiyu Chang UC Santa Barbara Tommi Jaakkola MIT CSAIL We study the problem of aligning the supports of distributions. Compared to the existing work on distribution alignment, support alignment does not require the densities to be matched. We propose symmetric support difference as a divergence measure to quantify the mismatch between supports. We show that select discriminators (e.g. discriminator trained for Jensen Shannon divergence) are able to map support differences as support differences in their one-dimensional output space. Following this result, our method aligns supports by minimizing a symmetrized relaxed optimal transport cost in the discriminator 1D space via an adversarial process. Furthermore, we show that our approach can be viewed as a limit of existing notions of alignment by increasing transportation assignment tolerance. We quantitatively evaluate the method across domain adaptation tasks with shifts in label distributions. Our experiments1 show that the proposed method is more robust against these shifts than other alignment-based baselines. 1 INTRODUCTION Learning tasks often involve estimating properties of distributions from samples or aligning such characteristics across domains. We can align full distributions (adversarial domain alignment), certain statistics (canonical correlation analysis), or the support of distributions (this paper). Much of the recent work has focused on full distributional alignment, for good reasons. In domain adaptation, motivated by theoretical results (Ben-David et al., 2007; 2010), a series of papers (Ajakan et al., 2014; Ganin & Lempitsky, 2015; Ganin et al., 2016; Tzeng et al., 2017; Shen et al., 2018; Pei et al., 2018; Zhao et al., 2018; Li et al., 2018a; Wang et al., 2021; Kumar et al., 2018) seek to align distributions of representations between domains, and utilize a shared classifier on the aligned representation space. Alignment in distributions implies alignment in supports. However, when there are additional objectives/constraints to satisfy, the minimizer for a distribution alignment objective does not necessarily minimize a support alignment objective. Example in Figure 1 demonstrates the qualitative distinction between two minimizers when distribution alignment is not achievable. The distribution alignment objective prefers to keep supports unaligned even if support alignment is achievable. Recent works (Zhao et al., 2019; Li et al., 2020; Tan et al., 2020; Wu et al., 2019b; Tachet des Combes et al., 2020) have demonstrated that a shift in label distributions between source and target leads to a characterizable performance drop when the representations are forced into a distribution alignment. The error bound in Johansson et al. (2019) suggests aligning the supports of representations instead. In this paper, we focus on distribution support as the key characteristic to align. We introduce a support divergence to measure the support mismatch and algorithms to optimize such alignment. We also position our approach in the spectrum of other alignment methods. Our contributions are as follows (all proofs can be found in Appendix A): 1. In Section 2.1, we measure the differences between supports of distributions. Building on the Hausdorff distance, we introduce a novel support divergence better suited for optimization, which we refer to as symmetric support difference (SSD) divergence. First two authors contributed equally. Correspondence to Shangyuan Tong (sytong@csail.mit.edu). 1We provide the code reproducing experiment results at https://github.com/timgaripov/asa. Published as a conference paper at ICLR 2022 2. In Section 2.2, we identify an important property of the discriminator trained for Jensen Shannon divergence: support differences in the original space of interest are preserved as support differences in the one-dimensional discriminator output space. 3. In Section 3, we present our practical algorithm for support alignment, Adversarial Support Alignment (ASA). Essentially, based on the analysis presented in Section 2.2, our solution is to align supports in the discriminator 1D space, which is computationally efficient. 4. In Section 4, we place different notions of alignment distribution alignment, relaxed distribution alignment and support alignment within a coherent spectrum from the point of view of optimal transport, characterizing their relationships, both theoretically in terms of their objectives and practically in terms of their algorithms. 5. In Section 5, we demonstrate the effectiveness of support alignment in practice for domain adaptation setting. Compared to other alignment-based baselines, our proposed method is more robust against shifts in label distributions. 3 2 1 0 1 x (a) Initialization DW (p, qθ) = 11.12 D (p, qθ) = 14.9 0.0 0.2 0.4 0.6 0.8 1.0 1.2 (b) Distribution alignment DW (p, qθ) = 2 10 3 D (p, qθ) = 6 10 4 0.0 0.2 0.4 0.6 0.8 1.0 x (c) Support alignment DW (p, qθ) = 5 10 2 D (p, qθ) < 1 10 6 Figure 1: Illustration of differences between the final configurations of distribution alignment and support alignment procedures. p(x) is a fixed Beta distribution p(x) = Beta(x | 4, 2) with support [0, 1]; qθ(x) is a shifted Beta distribution qθ(x) = Beta(x θ | 2, 4) parameterized by θ with support [θ, θ + 1]. Panel (a) shows the initial configuration with θinit = 3. Panel (b) shows the result by distribution alignment. Panel (c) shows the result by support alignment. We report Wasserstein distance DW (p, qθ) (7) and SSD divergence D (p, qθ) (1). 2 SSD DIVERGENCE AND SUPPORT ALIGNMENT Notation. We consider an Euclidean space X = Rn equipped with Borel sigma algebra B and a metric d : X X R (e.g. Euclidean distance). Let P be the set of probability measures on (X, B). For p P, the support of p is denoted by supp(p) and is defined as the smallest closed set X X such that p(X) = 1. f p denotes the pushforward measure of p induced by a measurable mapping f. With a slight abuse of notation, we use p(x) and [f p](t) to denote the densities of measures p and f p evaluated at x and t respectively, implicitly assuming that the measures are absolutely continuous. The distance between a point x X and a subset Y X is defined as d(x, Y ) = infy Y d(x, y). The symmetric difference of two sets A and B is defined as A B = (A \ B) (B \ A). 2.1 DIFFERENCE BETWEEN SUPPORTS To align the supports of distributions, we first need to evaluate how different they are. Similar to distribution divergences like Jensen Shannon divergence, we introduce a notion of support divergence. A support divergence2 between two distributions in P is a function DS( , ) : P P R satisfying: 1) DS(p, q) 0 for all p, q P; 2) DS(p, q) = 0 iff supp(p) = supp(q). While a distribution divergence is sensitive to both density and support differences, a support divergence only needs to detect mismatches in supports, which are subsets of the metric space X. 2It is not technically a divergence on the space of distributions, since DS(p, q) = 0 does not imply p = q. Published as a conference paper at ICLR 2022 An example of a distance between subsets of a metric space is the Hausdorff distance: d H(X, Y ) = max{supx X d(x, Y ), supy Y d(y, X)}. Since it depends only on the greatest distance between a point and a set, minimizing this objective for alignment only provides signal to a single point. To make the optimization less sparse, we consider all points that violate the support alignment criterion and introduce symmetric support difference (SSD) divergence: D (p, q) = Ex p [d(x, supp(q))] + Ex q [d(x, supp(p))] . (1) Proposition 2.1. SSD divergence D (p, q) is a support divergence. We note that our proposed SSD divergence is closely related to Chamfer distance/divergence (CD) (Fan et al., 2017; Nguyen et al., 2021) and Relaxed Word Mover s Distance (RWMD) (Kusner et al., 2015). While both CD and RWMD are stated for discrete points (see Section 6 for further comments), SSD divergence is a general difference measure between arbitrary (discrete or continuous) distributions. This distinction, albeit small, is important in our theoretical analysis (Sections 2.2, 4.1). 2.2 SUPPORT ALIGNMENT IN ONE-DIMENSIONAL SPACE Goodfellow et al. (2014) showed that the log-loss discriminator f : X [0, 1], trained to distinguish samples from distributions p and q (supf Ex p [log f(x)] + Ex q [log(1 f(x))]) can be used to estimate the Jensen Shannon divergence between p and q. The closed form maximizer f is f (x) = p(x) p(x) + q(x), x supp(p) supp(q). (2) Note that for a point x / supp(p) supp(q) the value of f (x) can be set to an arbitrary value in [0, 1], since the log-loss does not depend on f(x) for such x. The form of the optimal discriminator (2) gives rise to our main theorem below, which characterizes the ability of the log-loss discriminator to identify support misalignment. Theorem 2.1. Let p and q be the distributions with densities satisfying 1 C < p(x) < C, x supp(p); 1 C < q(x) < C, x supp(q). (3) Let f be the optimal discriminator (2). Then, D (p, q) = 0 if and only if D (f p, f q) = 0. The idea of the proof is to show that the extreme values (0 and 1) of f (x) can only be attained in x supp(p) supp(q). Assumption (3) guarantees that f (x) cannot approach neither 0 nor 1 in the intersection of the supports supp(p) supp(q), i.e. the values {f (x) | x supp(p) supp(q)} are separated from the extreme values 0 and 1. We conclude this section with two technical remarks on Theorem 2.1. Remark 2.1.1. The result of Theorem 2.1 does not necessarily hold for other types of discriminators. For instance, the dual Wasserstein discriminator (Arjovsky et al., 2017; Gulrajani et al., 2017) does not always highlight the support difference in the original space as a support difference in the discriminator output space. This observation is formaly stated in the following proposition. Proposition 2.2. Let f W be the maximizer of supf:L(f) 1 Ex p[f(x)] Ex q[f(x)], where L( ) is the Lipschitz constant. There exist p and q with supp(p) = supp(q) but supp(f W p) = supp(f W q). Remark 2.1.2. In practice the discriminator is typically parameterized as f(x) = σ(g(x)), where g : X R is realized by a deep neural network and σ(x) = (1 + e x) 1 is the sigmoid function. The optimization problem for g is inf g Ex p h log(1 + e g(x)) i + Ex q h log(1 + eg(x)) i , (4) and the optimal solution is g (x) = log p(x) log q(x). Naturally the result of Theorem 2.1 holds for g , since g (x) = σ 1(f (x)) and σ is a bijective mapping from R { , } to [0, 1]. Published as a conference paper at ICLR 2022 3 ADVERSARIAL SUPPORT ALIGNMENT We consider distributions p and q parameterized by θ: pθ, qθ. The log-loss discriminator g optimized for (4) is parameterized by ψ: gψ. Our analysis in Section 2.2 already suggests an algorithm. Namely, we can optimize θ by minimizing D (gψ pθ, gψ qθ) while optimizing ψ by (4). This adversarial game is analogous to the setup of the existing distribution alignment algorithms3. In practice, rather than having direct access to pθ, qθ, which is unavailable, we are often given i.i.d. samples {xp i }N i=1, {xq i }M i=1. They form discrete distributions ˆpθ(x) = 1 N PN i=1 δ(x xp i ), ˆqθ(x) = 1 M PM i=1 δ(x xq i ), and [gψ ˆpθ](t) = 1 N PN i=1 δ(t gψ(xp i )), [gψ ˆqθ](t) = 1 M PM i=1 δ(t gψ(xq i )). Since gψ ˆpθ and gψ ˆqθ are discrete distributions, they have supports {gψ(xp i )}N i=1 and {gψ(xq i )}M i=1 respectively. SSD divergence between discrete distributions gψ ˆpθ and gψ ˆqθ is D (gψ ˆpθ, gψ ˆqθ) = 1 i=1 d gψ(xp i ), {gψ(xq j)}M j=1 + 1 i=1 d gψ(xq i ), {gψ(xp j)}N j=1 . (5) Effect of mini-batch training. When training on large datasets, we need to rely on stochastic optimization with mini-batches. We denote the mini-batches (of same size, as in common practice) from pθ and qθ as xp = {xp i }m i=1 and xq = {xq i }m i=1 respectively. By minimizing D (gψ(xp), gψ(xq)), we only consider the mini-batch support distance rather than the population support distance (5). We observe that in practice the described algorithm brings the distributions to a state closer to distribution alignment rather than support alignment (see Appendix D.5 for details). The problem is in the typically small batch size. The algorithm actually tries to enforce support alignment for all possible pairs of mini-batches, which is a much stricter constraint than population support alignment. To address the issue mentioned above, without working with a much larger batch size, we create two history buffers : hp, storing the previous 1D discriminator outputs of (at most) n samples from pθ, and a similar buffer hq for qθ. Specifically, h = {gψold,i(xold,i)}n i=1 stores the values of the previous n samples xold,i mapped by their corresponding past versions of the discriminator gψold,i. We minimize D (vp, vq), where vp = concat(hp, gψ(xp)), vq = concat(hq, gψ(xq)): D (vp, vq) = 1 n + m i=1 d(vp i , vq) + j=1 d(vq j, vp) Note that D ( , ) between two sets of 1D samples can be efficiently calculated since d(vp i , vq) and d(vq j, vp) are simply 1-nearest neighbor distances in 1D. Moreover the history buffers store only the scalar values from the previous batches. These values are only considered in nearest neighbor assignment but do not directly provide gradient signal for optimization. Thus, the computation overhead of including a long history buffer is very light. We present our full algorithm, Adversarial Support Alignment (ASA), in Algorithm 1. 4 SPECTRUM OF NOTIONS OF ALIGNMENT In this section, we take a closer look into our work and different existing notions of alignment that have been proposed in the literature, especially their formulations from the optimal transport perspective. We show that our proposed support alignment framework is a limit of existing notions of alignment, both in terms of theory and algorithm, by increasing transportation assignment tolerance. 4.1 THEORETICAL CONNECTIONS Distribution alignment. Wasserstein distance is a commonly used objective for distribution alignment. In our analysis, we focus on the Wasserstein-1 distance: DW (p, q) = inf γ Γ(p,q) E(x,y) γ[d(x, y)], (7) 3Following existing adversarial distribution alignment methods, e.g. (Goodfellow et al., 2014), we use single update of ψ per 1 update of θ. While theoretical analysis for both distribution alignment and support alignment (ours) assume optimal discriminators, training with single update of ψ is computationally cheap and effective. Published as a conference paper at ICLR 2022 Algorithm 1 Our proposed ASA algorithm. n (maximum history buffer size), we use n = 1000. 1: for number of training steps do 2: Sample mini-batches {xp i }m i=1 pθ, {xq i }m i=1 qθ. 3: Perform optimization step on ψ using stochastic gradient h log(1 + exp( gψ(xp i ))) + log(1 + exp(gψ(xq i ))) i . 4: vp concat(hp, {gψ(xp i )}m i=1), vq concat(hq, {gψ(xq i )}m i=1). 5: πi p q arg minj d(vp i , vq j), πj q p arg mini d(vp i , vq j). 6: Perform optimization step on θ using stochastic gradient h d(vp i , vq πip q) + d(vq i , vp πiq p) i . 7: UPDATEHISTORY(hp, {gψ(xp i )}m i=1), UPDATEHISTORY(hq, {gψ(xq i )}m i=1). 8: end for where Γ(p, q) is the set of all measures on X X with marginals of p and q, respectively. The value of DW (p, q) is the minimal transportation cost for transporting probability mass from p to q. The transportation cost is zero if and only if p = q, meaning the distributions are aligned. Relaxed distribution alignment. Wu et al. (2019b) proposed a modified Wasserstein distance to achieve asymmetrically-relaxed distribution alignment, namely β-admissible Wasserstein distance: Dβ W (p, q) = inf γ Γβ(p,q) E(x,y) γ[d(x, y)], (8) where Γβ(p, q) is the set of all measures γ on X X such that R γ(x, y)dy = p(x), x and R γ(x, y)dx (1 + β)q(y), y. With the relaxed marginal constraints, one could choose a transportation plan γ which transports probability mass from p to a modified distribution q rather than the original distribution q as long as q satisfies the constraint q (x) (1 + β)q(x), x. Therefore, Dβ W (p, q) is zero if and only if p(x) (1 + β)q(x), x. In (Wu et al., 2019b), β is normally set to a positive finite number to achieve the asymmetric-relaxation of distribution alignment, and it is shown that D0 W (p, q) = DW (p, q). We can extend Dβ W (p, q) to a symmetric version, which we term β1, β2-admissible Wasserstein distance: Dβ1,β2 W (p, q) = Dβ1 W (p, q) + Dβ2 W (q, p). (9) The aforementioned property of β-admissible Wasserstein distance implies that Dβ1,β2 W (p, q) = 0 if and only if p(x) (1 + β1)q(x), x and q(x) (1 + β2)p(x), x, in which case we call p and q (β1, β2)-aligned , with β1 and β2 controlling the transportation assignment tolerances. Support alignment. The term Ep[d(x, supp(q))] in (1) represents the average distance from samples in p to the support of q. From the optimal transport perspective, this value is the minimal transportation cost of transporting the probability mass of p into the support of q. We show that SSD divergence can be considered as a transportation cost in the limit of infinite assignment tolerance. Proposition 4.1. D , W (p, q) := limβ1,β2 Dβ1,β2 W (p, q) = D (p, q). We now have completed the spectrum of alignment objectives defined within the optimal transport framework. The following proposition establishes the relationship within the spectrum. Proposition 4.2. Let p and q be two distributions in P. Then, 1. DW (p, q) = 0 implies Dβ1,β2 W (p, q) = 0 for all finite β1, β2 > 0. 2. Dβ1,β2 W (p, q) = 0 for some finite β1, β2 > 0 implies D (p, q) = 0. 3. The converse of statements 1 and 2 are false. In addition to the result presented in Theorem 2.1, we can show that the log-loss discriminator can also preserve the existing notions of alignment. Proposition 4.3. Let f be the optimal discriminator (2) for given distributions p and q. Then, 1. DW (p, q) = 0 iff DW (f p, f q) = 0; 2. Dβ1,β2 W (p, q) = 0 iff Dβ1,β2 W (f p, f q) = 0. Published as a conference paper at ICLR 2022 4.2 ALGORITHMIC CONNECTIONS The result of Proposition 4.3 suggests methods similar to our ASA algorithm presented in Section 3 can achieve different notions of alignment by minimizing objectives discussed in Section 4.1 between the 1D pushforward distributions. We consider the setup used in Section 3 but without history buffers to simplify the analysis, as their usage is orthogonal to our discussion in this section. Recall that we work with a mini-batch setting, where {xp i }m i=1 and {xq i }m i=1 are sampled from p and q respectively, and g is the adversarial log-loss discriminator. We denote the corresponding 1D outputs from the log-loss discriminator by op = {op i }m i=1 = {g(xp i )}m i=1 and oq (defined similarly). Distribution alignment. We adapt (7) for {op i }m i=1 and {oq i }m i=1: DW (op, oq) = inf γ Γ(op,oq) 1 m j=1 γijd(op i , oq j), (10) where Γ(op, oq) is the set of m m doubly stochastic matrices. Since op and oq are sets of 1D samples with the same size, it can be shown (Rabin et al., 2011) that the optimal γ corresponds to an assignment π , which pairs points in the sorting order and can be computed efficiently by sorting both sets op and oq. The transportation cost is zero if and only if there exists an invertible 1-to-1 assignment π such that op i = oq π (i). GAN training algorithms proposed in (Deshpande et al., 2018; 2019) utilize the above sorting procedure to estimate the maximum sliced Wasserstein distance. Relaxed distribution alignment. Similarly, we can adapt (8): Dβ W (op, oq) = inf γ Γβ(op,oq) 1 m j=1 γijd(op i , oq j), (11) where Γβ(op, oq) is the set of m m matrices with non-negative real entries, such that Pm j=1 γij = 1, i and Pm i=1 γij 1 + β, j. The optimization goal in (11) is to find a soft-assignment γ which describes the transportation of probability mass from points op i in op to points oq i in oq. The parameter β controls the set of admissible assignments Γβ, which is similar to its role discussed in Section 4.1: with transportation assignment tolerance β, the total mass of points in op transported to each of the points oq i cannot exceed 1 + β. We refer to such assignments as (β + 1)-to-1 assignment. The transportation cost is zero if and only if there exists such an assignment between op and oq. It can be shown (see Appendix C) that for integer value of β, the set of minimizers of (11) must contain a hard-assignment transportation plan, which assigns each point op i to exactly one point oq j. Then (1 + β) gives the upper bound on the number of points op i that can be transported to given point oq j. This hard assignment problem can be solved quasi-linearly with worst case time complexity O (β + 1)m2 (Bonneel & Coeurjolly, 2019), which, combined with Proposition 4.3, can lead to new algorithms for relaxed distribution alignment besides those proposed in Wu et al. (2019b). Support alignment. When β = , the sum Pm i=1 γij is unconstrained for all j, and each point op i can be assigned to any of the points oq j. The optimal solution is simply 1-nearest neighbor assignment, or to follow the above terminology, -to-1 assignment. 5 EXPERIMENTS Problem setting. We evaluate our proposed ASA method in the setting of unsupervised domain adaptation (UDA). The goal of UDA algorithms is to train and adapt a classification model M : X Y from source domain distribution p X,Y to target domain distribution q X,Y given the access to a labeled source dataset {xp i , yp i }Np i=1 p X,Y and an unlabeled target dataset {xq i }Nq i=1 q X. A common approach for UDA is to represent M as Cϕ F θ: a classifier Cϕ : Z Y and a feature extractor F θ : X Z, and train Cϕ and F θ by minimizing: 1) classification loss ℓcls on source examples; 2) alignment loss Dalign measuring discrepancy between pθ Z = F θ p X and qθ Z = F θ q X: min ϕ,θ 1 N p i=1 ℓcls(Cϕ(F θ(xp i )), yp i ) + λ Dalign {F θ(xp i )}Np i=1, {F θ(xq i )}Nq i=1 , (12) Published as a conference paper at ICLR 2022 1.5 1.0 0.5 0.0 0.5 1.0 1.5 2.0 2.5 3 (a) No DA (avg acc: 63%) DW (pθ Z, qθ Z) = 0.78 D (pθ Z, qθ Z) = 0.10 2 1 0 1 2 2 (b) DANN (avg acc: 75%) DW (pθ Z, qθ Z) = 0.07 D (pθ Z, qθ Z) = 0.02 Target Source class 3 class 5 class 9 class 3 class 5 class 9 (c) ASA-abs (avg acc: 94%) DW (pθ Z, qθ Z) = 0.59 D (pθ Z, qθ Z) = 0.03 Figure 2: Visualization of learned 2D embeddings on 3-class USPS MNIST with label distribution shift. In source domain, all classes have equal probability 1 3. The target probabilities of classes 3 , 5 , 9 are [23%, 65%, 12%]. Each panel shows 2 level sets (outer one approximates the support) of the kernel density estimates of embeddings in source (filled regions) and target domains (solid/dashed lines). We report the average class accuracy of the target domain, DW and D between embeddings. In practice Dalign is an estimate of a divergence measure via an adversarial discriminator gψ. Choices of Dalign include f-divergences (Ganin et al., 2016; Nowozin et al., 2016) and Wasserstein distance (Arjovsky et al., 2017) to enforce distribution alignment and versions of re-weighted/relaxed distribution divergences (Wu et al., 2019b; Tachet des Combes et al., 2020) to enforce relaxed distribution alignment. For support alignment, we apply the proposed ASA method as the alignment subroutine in (12) with log-loss discriminator gψ (4) and Dalign computed as (6). Task specifications. We consider 3 UDA tasks: USPS MNIST, STL CIFAR, and Vis DA-2017, and 2 versions of ASA: ASA-sq, ASA-abs corresponding to squared and absolute distances respectively for d( , ) in (6). We compare ASA with: No DA (no domain adaptation), DANN (Ganin et al., 2016) (distribution alignment with JS divergence), VADA (Shu et al., 2018) (distribution alignment with virtual adversarial training), IWDAN, IWCDAN (Tachet des Combes et al., 2020) (relaxed distribution alignment via importance weighting) s DANN-β (Wu et al., 2019b) (relaxed/β-admissible JS divergence via re-weighting). Please refer to Appendix D for full experimental details. To evaluate the robustness of the methods, we simulate label distribution shift by subsampling source and target dataset, so that source has balanced label distribution and target label distribution follows the power law q Y (y) σ(y) α, where σ is a random permutation of class labels {1, . . . , K} and α controls the severity of the shift (α = 0 means balanced label distribution). For each task, we generate 5 random permutations σ for 4 different shift levels α {0, 1, 1.5, 2}. Essentially we transform each (source, target) dataset pair to 5 4 = 20 tasks of different difficulty levels, since classes are not equally difficult and different permutations can give them different weights. Evaluation metrics. We choose the average (per-)class accuracy and minimum (per-)class accuracy on the target test set as evaluation metrics. Under the average class accuracy metric, all classes are treated as equally important (despite the unequal representation during training for α > 0), and the minimum class accuracy focuses on model s worst within-class performance. In order to account for the variability of task difficulties across random permutations of target labels, we report robust statistics, median and a 25-75 percentile interval, across 5 runs. Illustrative example. First we consider a simplified setting to intuitively understand and directly analyze the behavior of our proposed support alignment method in domain adaptation under label distribution shift. We consider a 3-class USPS MNIST problem by selecting a subset of examples corresponding to digits 3 , 5 , and 9 , and use a feature extractor network with 2D output space. We introduce label distribution shift as described above with α = 1.5, i.e. the probabilities of classes in the target domain are 12%, 23%, and 65%. We compare No DA, DANN, ASA-abs by their average target classification accuracy, Wasserstein distance DW (pθ Z, qθ Z) and SSD divergence D (pθ Z, qθ Z) between the learned embeddings of source and target domain. We apply a global affine transformation to each embedding space in order to have comparable distances between different spaces: we center the embeddings so that their average is 0 and re-scale them so that their average norm is 1. The Published as a conference paper at ICLR 2022 Table 1: Average and minimum class accuracy (%) on USPS MNIST with different levels of shifts in label distributions (higher α implies more severe imbalance). We report median (the main number), and 25 (subscript) and 75 (superscript) percentiles across 5 runs. α = 0.0 α = 1.0 α = 1.5 α = 2.0 Algorithm average min average min average min average min No DA 71.9 72.9 70.4 20.3 22.9 17.6 72.9 74.7 72.0 25.8 31.8 18.3 71.3 72.5 71.2 27.5 37.3 24.2 71.3 73.0 70.6 16.6 26.8 10.8 DANN 97.8 97.8 97.6 96.0 96.1 95.8 83.5 84.6 76.7 25.1 36.9 08.4 70.0 71.2 63.9 01.1 01.5 01.0 57.8 60.4 52.0 00.9 01.6 00.5 VADA 98.0 98.0 97.9 96.2 96.3 95.9 88.2 89.9 88.1 48.9 50.0 47.8 78.2 83.1 70.7 06.6 23.5 02.4 61.9 65.4 56.3 01.4 01.5 00.8 IWDAN 97.5 97.5 97.4 95.7 95.9 95.7 95.7 95.8 92.6 81.3 82.3 67.1 86.5 87.8 80.2 15.2 55.0 04.2 74.4 78.6 70.0 07.3 22.4 06.3 IWCDAN 98.0 98.1 97.9 96.6 96.9 96.4 96.7 97.5 93.3 85.1 93.9 65.3 91.3 93.8 90.5 66.5 74.5 64.1 77.5 82.3 77.3 22.2 45.4 02.7 s DANN-4 87.4 95.7 87.2 05.6 90.0 05.6 94.9 94.9 94.7 85.7 87.7 84.4 86.8 89.1 85.5 21.6 50.3 15.4 81.5 83.1 81.3 39.3 56.2 37.9 ASA-sq 93.7 93.9 93.3 89.2 89.4 88.4 92.3 93.6 91.5 83.5 88.7 80.8 90.9 92.1 89.6 69.9 82.0 66.6 87.2 89.3 85.8 62.5 69.3 46.4 ASA-abs 94.1 94.5 93.8 88.9 91.2 87.0 92.8 93.2 89.3 78.9 82.9 65.1 92.5 92.9 90.9 82.4 85.4 74.5 90.4 90.7 89.2 68.4 73.0 67.5 Table 2: Results on STL CIFAR. Same setup and reporting metrics as Table 1. α = 0.0 α = 1.0 α = 1.5 α = 2.0 Algorithm average min average min average min average min No DA 69.9 70.0 69.8 49.8 50.6 45.3 68.8 69.3 68.3 47.2 48.2 45.3 66.8 67.2 66.4 46.0 47.0 45.8 65.8 66.7 64.8 43.7 44.6 41.6 DANN 75.3 75.4 74.9 54.6 56.6 54.2 69.9 70.1 68.6 44.8 45.1 40.7 64.9 67.1 63.7 34.9 36.8 33.9 63.3 64.8 57.4 27.0 28.5 21.2 VADA 76.7 76.7 76.6 56.9 58.3 53.5 70.6 71.0 70.0 47.7 48.8 44.0 66.1 66.5 65.4 35.7 39.3 33.3 63.2 64.7 60.2 25.5 28.0 25.2 IWDAN 69.9 70.7 69.9 50.5 50.6 47.9 68.7 69.1 68.6 45.8 50.5 44.8 67.1 67.3 65.9 44.7 44.8 40.4 64.4 64.9 63.6 36.8 37.9 34.5 IWCDAN 70.1 70.2 70.1 47.8 49.3 42.4 69.4 69.4 69.1 47.1 51.3 46.3 66.1 67.2 65.0 39.9 40.8 37.7 64.5 65.1 63.9 37.0 40.2 35.5 s DANN-4 71.8 72.1 71.7 52.1 52.8 52.1 71.1 71.7 70.4 49.9 51.8 48.1 69.4 70.0 68.7 48.6 49.0 43.5 66.4 67.9 66.2 39.0 47.1 33.6 ASA-sq 71.7 71.9 71.7 52.9 53.4 46.7 70.7 71.0 70.4 51.6 52.7 46.8 69.2 69.3 69.2 45.6 52.0 43.3 68.1 68.2 67.2 44.7 45.9 39.8 ASA-abs 71.6 71.7 71.2 49.0 53.5 48.4 70.9 71.0 70.8 49.2 50.0 47.3 69.6 69.9 69.6 43.2 49.5 42.1 67.8 68.2 66.6 40.9 49.0 35.4 results are shown in Figure 2 and Table D.5. Compared to No DA, both DANN and ASA achieve support alignment. DANN enforces distribution alignment, and thus places some target embeddings into regions corresponding to the wrong class. In comparison, ASA does not enforce distribution alignment and maintains good class correspondence across the source and target embeddings. Main results. The results of the main experimental evaluations are shown in Tables 1, 2, 3. Without any alignment, source only training struggles relatively to adapt to the target domain. Nonetheless, its performance across the imbalance levels remains robust, since the training procedure is the same. Agreeing with the observation and theoretical results from previous work (Zhao et al., 2019; Li et al., 2020; Tan et al., 2020; Wu et al., 2019b; Tachet des Combes et al., 2020), distribution alignment methods (DANN and VADA) perform well when there is no shift but suffer otherwise, whereas relaxed distribution alignment methods (IWDAN, IWCDAN and s DANN-β) show more resilience to shifts. On all tasks with positive α, we observe that it is common for the existing methods to achieve good class average accuracies while suffering significantly on some individual classes. These results suggest that the often-ignored but important min-accuracy metric can be very challenging. Finally, our support alignment methods (ASA-sq and ASA-abs) are the most robust ones against the shifts, while still being competitive in the more balanced settings (α = 0 or 1). We achieve best results in the more imbalanced and difficult tasks (α = 1.5 or 2) for almost all categories on all datasets. Please refer to Appendix D for ablation studies and additional comparisons. 6 RELATED WORK Distribution alignment. Apart from the works, e.g. (Ajakan et al., 2014; Ganin et al., 2016; Ganin & Lempitsky, 2015; Pei et al., 2018; Zhao et al., 2018; Long et al., 2018; Tachet des Combes et al., Published as a conference paper at ICLR 2022 Table 3: Results on Vis DA17. Same setup and reporting metrics as Table 1. α = 0.0 α = 1.0 α = 1.5 α = 2.0 Algorithm average min average min average min average min No DA 49.5 50.5 49.4 22.2 24.6 22.2 50.2 50.8 49.2 21.2 21.3 20.7 47.1 47.6 46.6 18.6 22.2 18.6 45.3 46.5 45.2 19.5 19.8 14.4 DANN 75.4 76.2 74.4 36.7 40.9 35.6 64.1 65.3 62.8 25.0 29.3 24.8 52.1 52.3 51.4 11.5 12.4 11.4 43.1 44.3 39.1 03.6 14.3 03.6 VADA 75.3 76.0 74.8 40.5 41.8 39.7 64.6 65.1 61.2 22.8 28.2 21.7 53.0 54.2 51.6 14.8 21.7 13.7 43.9 44.7 40.9 08.5 11.1 05.0 IWDAN 73.2 73.3 72.9 31.7 34.8 22.8 64.4 64.6 61.1 12.1 24.7 05.0 51.3 56.6 51.0 04.6 10.4 02.1 45.1 48.0 41.7 04.6 13.6 01.2 IWCDAN 71.6 75.2 70.6 27.6 28.0 22.8 60.6 61.0 60.2 01.1 11.3 00.7 49.7 51.9 45.6 02.2 05.7 00.2 38.3 46.2 37.3 00.6 01.7 00.3 s DANN-4 72.4 73.3 71.8 37.8 40.8 32.3 68.4 68.7 66.2 26.6 29.4 26.2 57.2 57.8 56.8 18.6 23.9 16.7 50.7 51.7 49.8 18.6 20.0 17.1 ASA-sq 64.9 65.0 63.7 35.7 35.8 32.1 61.8 63.2 60.6 31.4 34.4 20.4 57.8 58.3 55.5 26.7 32.1 17.3 51.9 52.0 50.8 18.3 21.2 16.9 ASA-abs 64.8 65.0 64.5 40.6 41.9 36.0 62.0 62.3 60.5 27.3 29.7 16.7 57.1 58.4 56.2 26.0 31.2 13.9 52.5 56.6 51.9 19.7 22.2 17.7 2020; Li et al., 2018b; Tzeng et al., 2017; Shen et al., 2018; Kumar et al., 2018; Li et al., 2018a; Wang et al., 2021; Goodfellow et al., 2014; Arjovsky et al., 2017; Gulrajani et al., 2017; Mao et al., 2017; Radford et al., 2015; Salimans et al., 2018; Genevay et al., 2018; Wu et al., 2019a; Deshpande et al., 2018; 2019), that do distribution alignment, there are also papers (Long et al., 2015; 2017; Peng et al., 2019; Sun et al., 2016; Sun & Saenko, 2016) focusing on aligning some characteristics of the distribution, such as first or second moments. Our work is concerned with a different problem, support alignment, which is a novel objective in this line of work. In terms of methodology, our use of the discriminator output space to work with easier optimization in 1D is inspired by a line of work (Salimans et al., 2018; Genevay et al., 2018; Wu et al., 2019a; Deshpande et al., 2018; 2019) on sliced Wasserstein distance based models. Our result in Proposition 4.3 also provides theoretical insight on the practical effectiveness of 1D OT in (Deshpande et al., 2019). Relaxed distribution alignment. In Section 4, we have already covered in detail the connections between our work and (Wu et al., 2019b). Balaji et al. (2020) introduced relaxed distribution alignment with a different focus, aiming to be insensitive to outliers. Chamfer distance/divergence (CD) is used to compute similarity between images/3D point clouds (Fan et al., 2017; Nguyen et al., 2021). For text data, Kusner et al. (2015) presented Relaxed Word Mover s Distance (RWMD) to prune candidates of similar documents. CD and RWMD are essentially the same as (5) with d( , ) being the Euclidean distance. They are computed by finding the nearest neighbor assignments. Our subroutine of calculating the support distance in the 1D discriminator output space is done similarly by finding nearest neighbors within the current batch and history buffers. Support estimation. There exists a series of work, e.g. (Schölkopf et al., 2001; Hoffmann, 2007; Tax & Duin, 2004; Knorr et al., 2000; Chalapathy et al., 2017; Ruff et al., 2018; Perera et al., 2019; Deecke et al., 2018; Zenati et al., 2018), on novelty/anomaly detection problem, which can be casted as support estimation. We consider a fundamentally different problem setting. Our goal is to align the supports and our approach does not directly estimate the supports. Instead, we implicitly learn the relationships between supports (density ratio to be specific) via a discriminator. 7 CONCLUSION AND FUTURE WORK In this paper, we studied the problem of aligning the supports of distributions. We formalized its theoretical connections with existing alignment notions and demonstrated the effectiveness of the approach in domain adaptation. We believe that our methodology opens possibilities for the design of more nuanced and structured alignment constraints, suitable for various use cases. One natural extension is support containment, achievable with only one term in (1). This approach is fitting for partial domain adaptation, where some source domain classes do not appear in the target domain. Another interesting direction is unsupervised domain transfer, where support alignment is more desired than existing distribution alignment methods due to mode imbalance (Binkowski et al., 2019). Published as a conference paper at ICLR 2022 ACKNOWLEDGMENTS The computational experiments presented in this paper were performed on Satori cluster developed as a collaboration between MIT and IBM. We used Weights & Biases (Biewald, 2020) for experiment tracking and visualizations to develop insights for this paper. TJ acknowledges support from MIT-IBM Watson AI Lab, from Singapore DSO, and MIT-DSTA Singapore collaboration. We thank Xiang Fu and all anonymous reviewers for their helpful comments regarding the paper s writing and presentation. Hana Ajakan, Pascal Germain, Hugo Larochelle, François Laviolette, and Mario Marchand. Domainadversarial neural networks. ar Xiv preprint ar Xiv:1412.4446, 2014. 1, 8 Martin Arjovsky, Soumith Chintala, and Léon Bottou. Wasserstein generative adversarial networks. In International conference on machine learning, pp. 214 223. PMLR, 2017. 3, 7, 9 Yogesh Balaji, Rama Chellappa, and Soheil Feizi. Robust optimal transport with applications in generative modeling and domain adaptation. Advances in Neural Information Processing Systems Foundation (Neur IPS), 2020. 9 Shai Ben-David, John Blitzer, Koby Crammer, Fernando Pereira, et al. Analysis of representations for domain adaptation. Advances in neural information processing systems, 19:137, 2007. 1 Shai Ben-David, John Blitzer, Koby Crammer, Alex Kulesza, Fernando Pereira, and Jennifer Wortman Vaughan. A theory of learning from different domains. Machine learning, 79(1):151 175, 2010. 1 Dimitris Bertsimas and John N Tsitsiklis. Introduction to linear optimization, volume 6. Athena Scientific Belmont, MA, 1997. 21 Lukas Biewald. Experiment tracking with weights and biases, 2020. URL https://www.wandb. com/. Software available from wandb.com. 10 Mikolaj Binkowski, Devon Hjelm, and Aaron Courville. Batch weight for domain adaptation with mass shift. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 1844 1853, 2019. 9 Nicolas Bonneel and David Coeurjolly. Spot: sliced partial optimal transport. ACM Transactions on Graphics (TOG), 38(4):1 13, 2019. 6 Eric Budish, Yeon-Koo Che, Fuhito Kojima, and Paul Milgrom. Implementing random assignments: A generalization of the birkhoff-von neumann theorem. In 2009 Cowles Summer Conference, 2009. 21 Raghavendra Chalapathy, Aditya Krishna Menon, and Sanjay Chawla. Robust, deep and inductive anomaly detection. In Joint European Conference on Machine Learning and Knowledge Discovery in Databases, pp. 36 51. Springer, 2017. 9 Adam Coates, Andrew Ng, and Honglak Lee. An analysis of single-layer networks in unsupervised feature learning. In Proceedings of the fourteenth international conference on artificial intelligence and statistics, pp. 215 223. JMLR Workshop and Conference Proceedings, 2011. 22 Lucas Deecke, Robert Vandermeulen, Lukas Ruff, Stephan Mandt, and Marius Kloft. Image anomaly detection with generative adversarial networks. In Joint european conference on machine learning and knowledge discovery in databases, pp. 3 17. Springer, 2018. 9 Ishan Deshpande, Ziyu Zhang, and Alexander G Schwing. Generative modeling using the sliced wasserstein distance. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 3483 3491, 2018. 6, 9, 20 Ishan Deshpande, Yuan-Ting Hu, Ruoyu Sun, Ayis Pyrros, Nasir Siddiqui, Sanmi Koyejo, Zhizhen Zhao, David Forsyth, and Alexander G Schwing. Max-sliced wasserstein distance and its use for gans. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10648 10656, 2019. 6, 9, 20, 23 Published as a conference paper at ICLR 2022 Haoqiang Fan, Hao Su, and Leonidas J Guibas. A point set generation network for 3d object reconstruction from a single image. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 605 613, 2017. 3, 9 Yaroslav Ganin and Victor Lempitsky. Unsupervised domain adaptation by backpropagation. In International conference on machine learning, pp. 1180 1189. PMLR, 2015. 1, 8 Yaroslav Ganin, Evgeniya Ustinova, Hana Ajakan, Pascal Germain, Hugo Larochelle, François Laviolette, Mario Marchand, and Victor Lempitsky. Domain-adversarial training of neural networks. The journal of machine learning research, 17(1):2096 2030, 2016. 1, 7, 8 Aude Genevay, Gabriel Peyré, and Marco Cuturi. Learning generative models with sinkhorn divergences. In International Conference on Artificial Intelligence and Statistics, pp. 1608 1617. PMLR, 2018. 9 Ian J Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial networks. ar Xiv preprint ar Xiv:1406.2661, 2014. 3, 4, 9 Ishaan Gulrajani, Faruk Ahmed, Martin Arjovsky, Vincent Dumoulin, and Aaron Courville. Improved training of wasserstein gans. In Proceedings of the 31st International Conference on Neural Information Processing Systems, pp. 5769 5779, 2017. 3, 9, 23 Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 770 778, 2016. 25 Heiko Hoffmann. Kernel pca for novelty detection. Pattern recognition, 40(3):863 874, 2007. 9 Jonathan J. Hull. A database for handwritten text recognition research. IEEE Transactions on pattern analysis and machine intelligence, 16(5):550 554, 1994. 22 Fredrik D Johansson, David Sontag, and Rajesh Ranganath. Support and invertibility in domaininvariant representations. In The 22nd International Conference on Artificial Intelligence and Statistics, pp. 527 536. PMLR, 2019. 1 Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. ar Xiv preprint ar Xiv:1412.6980, 2014. 22 Edwin M Knorr, Raymond T Ng, and Vladimir Tucakov. Distance-based outliers: algorithms and applications. The VLDB Journal, 8(3):237 253, 2000. 9 Alex Krizhevsky. Learning multiple layers of features from tiny images. 2009. 22 Abhishek Kumar, Prasanna Sattigeri, Kahini Wadhawan, Leonid Karlinsky, Rogerio Feris, Bill Freeman, and Gregory Wornell. Co-regularized alignment for unsupervised domain adaptation. Advances in Neural Information Processing Systems, 31:9345 9356, 2018. 1, 9 Matt Kusner, Yu Sun, Nicholas Kolkin, and Kilian Weinberger. From word embeddings to document distances. In International conference on machine learning, pp. 957 966. PMLR, 2015. 3, 9 Yann Le Cun, Léon Bottou, Yoshua Bengio, and Patrick Haffner. Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11):2278 2324, 1998. 22 Bo Li, Yezhen Wang, Tong Che, Shanghang Zhang, Sicheng Zhao, Pengfei Xu, Wei Zhou, Yoshua Bengio, and Kurt Keutzer. Rethinking distributional matching based domain adaptation. ar Xiv preprint ar Xiv:2006.13352, 2020. 1, 8 Haoliang Li, Sinno Jialin Pan, Shiqi Wang, and Alex C Kot. Domain generalization with adversarial feature learning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5400 5409, 2018a. 1, 9 Ya Li, Xinmei Tian, Mingming Gong, Yajing Liu, Tongliang Liu, Kun Zhang, and Dacheng Tao. Deep domain generalization via conditional invariant adversarial networks. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 624 639, 2018b. 9 Published as a conference paper at ICLR 2022 Mingsheng Long, Yue Cao, Jianmin Wang, and Michael Jordan. Learning transferable features with deep adaptation networks. In International conference on machine learning, pp. 97 105. PMLR, 2015. 9 Mingsheng Long, Han Zhu, Jianmin Wang, and Michael I Jordan. Deep transfer learning with joint adaptation networks. In International conference on machine learning, pp. 2208 2217. PMLR, 2017. 9 Mingsheng Long, Zhangjie Cao, Jianmin Wang, and Michael I Jordan. Conditional adversarial domain adaptation. In Advances in Neural Information Processing Systems, pp. 1640 1650, 2018. 8 Xudong Mao, Qing Li, Haoran Xie, Raymond YK Lau, Zhen Wang, and Stephen Paul Smolley. Least squares generative adversarial networks. In Proceedings of the IEEE international conference on computer vision, pp. 2794 2802, 2017. 9 Trung Nguyen, Quang-Hieu Pham, Tam Le, Tung Pham, Nhat Ho, and Binh-Son Hua. Point-set distances for learning representations of 3d point clouds. ar Xiv preprint ar Xiv:2102.04014, 2021. 3, 9 Sebastian Nowozin, Botond Cseke, and Ryota Tomioka. f-gan: Training generative neural samplers using variational divergence minimization. In Proceedings of the 30th International Conference on Neural Information Processing Systems, pp. 271 279, 2016. 7 Zhongyi Pei, Zhangjie Cao, Mingsheng Long, and Jianmin Wang. Multi-adversarial domain adaptation. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 32, 2018. 1, 8 Xingchao Peng, Ben Usman, Neela Kaushik, Judy Hoffman, Dequan Wang, and Kate Saenko. Visda: The visual domain adaptation challenge, 2017. 25 Xingchao Peng, Qinxun Bai, Xide Xia, Zijun Huang, Kate Saenko, and Bo Wang. Moment matching for multi-source domain adaptation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 1406 1415, 2019. 9 Pramuditha Perera, Ramesh Nallapati, and Bing Xiang. Ocgan: One-class novelty detection using gans with constrained latent representations. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2898 2906, 2019. 9 Gabriel Peyré, Marco Cuturi, et al. Computational optimal transport: With applications to data science. Foundations and Trends in Machine Learning, 11(5-6):355 607, 2019. 21 Julien Rabin, Gabriel Peyré, Julie Delon, and Marc Bernot. Wasserstein barycenter and its application to texture mixing. In International Conference on Scale Space and Variational Methods in Computer Vision, pp. 435 446. Springer, 2011. 6, 20 Alec Radford, Luke Metz, and Soumith Chintala. Unsupervised representation learning with deep convolutional generative adversarial networks. ar Xiv preprint ar Xiv:1511.06434, 2015. 9 Lukas Ruff, Robert Vandermeulen, Nico Goernitz, Lucas Deecke, Shoaib Ahmed Siddiqui, Alexander Binder, Emmanuel Müller, and Marius Kloft. Deep one-class classification. In International conference on machine learning, pp. 4393 4402. PMLR, 2018. 9 Tim Salimans, Han Zhang, Alec Radford, and Dimitris Metaxas. Improving GANs using optimal transport. In International Conference on Learning Representations, 2018. URL https:// openreview.net/forum?id=rk Qk Bn JAb. 9 Bernhard Schölkopf, John C Platt, John Shawe-Taylor, Alex J Smola, and Robert C Williamson. Estimating the support of a high-dimensional distribution. Neural computation, 13(7):1443 1471, 2001. 9 Jian Shen, Yanru Qu, Weinan Zhang, and Yong Yu. Wasserstein distance guided representation learning for domain adaptation. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 32, 2018. 1, 9 Published as a conference paper at ICLR 2022 Rui Shu, Hung Bui, Hirokazu Narui, and Stefano Ermon. A DIRT-t approach to unsupervised domain adaptation. In International Conference on Learning Representations, 2018. URL https://openreview.net/forum?id=H1q-TM-AW. 7, 22, 24 Baochen Sun and Kate Saenko. Deep coral: Correlation alignment for deep domain adaptation. In European conference on computer vision, pp. 443 450. Springer, 2016. 9 Baochen Sun, Jiashi Feng, and Kate Saenko. Return of frustratingly easy domain adaptation. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 30, 2016. 9 Remi Tachet des Combes, Han Zhao, Yu-Xiang Wang, and Geoffrey J Gordon. Domain adaptation with conditional distribution matching and generalized label shift. Advances in Neural Information Processing Systems, 33, 2020. 1, 7, 8, 22 Shuhan Tan, Xingchao Peng, and Kate Saenko. Class-imbalanced domain adaptation: An empirical odyssey. In European Conference on Computer Vision, pp. 585 602. Springer, 2020. 1, 8 David MJ Tax and Robert PW Duin. Support vector data description. Machine learning, 54(1):45 66, 2004. 9 Eric Tzeng, Judy Hoffman, Kate Saenko, and Trevor Darrell. Adversarial discriminative domain adaptation. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 7167 7176, 2017. 1, 9 Jing Wang, Jiahong Chen, Jianzhe Lin, Leonid Sigal, and Clarence W de Silva. Discriminative feature alignment: Improving transferability of unsupervised domain adaptation by gaussian-guided latent alignment. Pattern Recognition, 116:107943, 2021. 1, 9 Jiqing Wu, Zhiwu Huang, Dinesh Acharya, Wen Li, Janine Thoma, Danda Pani Paudel, and Luc Van Gool. Sliced wasserstein generative models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), June 2019a. 9, 20 Yifan Wu, Ezra Winston, Divyansh Kaushik, and Zachary Lipton. Domain adaptation with asymmetrically-relaxed distribution alignment. In International Conference on Machine Learning, pp. 6872 6881. PMLR, 2019b. 1, 5, 6, 7, 8, 9, 23 Houssam Zenati, Manon Romain, Chuan-Sheng Foo, Bruno Lecouat, and Vijay Chandrasekhar. Adversarially learned anomaly detection. In 2018 IEEE International conference on data mining (ICDM), pp. 727 736. IEEE, 2018. 9 Han Zhao, Shanghang Zhang, Guanhang Wu, José MF Moura, Joao P Costeira, and Geoffrey J Gordon. Adversarial multiple source domain adaptation. In Advances in Neural Information Processing Systems, pp. 8568 8579, 2018. 1, 8 Han Zhao, Remi Tachet Des Combes, Kun Zhang, and Geoffrey Gordon. On learning invariant representations for domain adaptation. In International Conference on Machine Learning, pp. 7523 7532. PMLR, 2019. 1, 8 Published as a conference paper at ICLR 2022 A PROOFS OF THE THEORETICAL RESULTS A.1 PROOF OF PROPOSITION 2.1 1) D (p, q) 0 for all p, q P: Since d( , ) 0, for all p, q, SD(p, q) := Ex p[d(x, supp(q))] = Ex p inf y supp(q) d(x, y) 0, (13) which makes D (p, q) = SD(p, q) + SD(q, p) 0. 2) D (p, q) = 0 if and only if supp(p) = supp(q): With statement 1, D (p, q) = 0 if and only if SD(p, q) = 0 and SD(q, p) = 0. SD(p, q) = 0 = Ex p[d(x, supp(q))] = 0 = p ({x|d(x, supp(q)) > 0}) = 0. This is equivalent to x supp(p), d(x, supp(q)) = 0. Thus, supp(p) supp(q), and similarly, supp(q) supp(p), which makes supp(p) = supp(q). A.2 ASSUMPTION AND PROOF OF THEOREM 2.1 A.2.1 COMMENTS ON ASSUMPTION (3) Assumption (3) is not restrictive. Indeed, distributions satisfying Assumption (3) include: uniform p(x) = U(x; [a, b]); truncated normal; p(x) of the form ( 1 Zp e Ep(x), x supp(p), 0, x / supp(p), with non-negative energy (unnormalized log-density) function Ep : X [0, ); mixture of any distributions satisfying Assumption (3), for instance the distributions shown in Figure A.1 top-left are mixtures of truncated normal distributions on [ 2, 2]. Starting from arbitrary density p0(x) with bounded support we can derive a density p(x) satisfying Assumption (3) via density clipping and re-normalization p(x) clip p0(x), 1 for some C > 1. A.2.2 PROOF OF THEOREM 2.1 First, we show that D (p, q) = 0 implies D (f p, f q) = 0. D (p, q) = 0 implies supp(p) = supp(q). Then for any mapping f : X R, we have supp(f p) = supp(f q), which implies supp(f p) = supp(f q). Thus, D (f p, f q) = 0. Now, we prove that D (f p, f q) = 0 implies D (p, q) = 0 by contradiction. Published as a conference paper at ICLR 2022 D (f p, f q) = 0 implies the following: Et f p d(t, supp(f q) = 0, Et f q d(t, supp(f p) = 0. 1) Suppose Ex p[d(x, supp(q))] > 0. This is only possible if p({x | x supp(p) \ supp(q)}) > 0. Since x supp(p) \ supp(q) implies p(x) > 0, q(x) = 0, and for any x supp(p) supp(q), p(x) > 0, q(x) = 0 if and only if f (x) = p(x) p(x)+q(x) = 1, we have: Pf p({1}) = Pp({x | x supp(p) \ supp(q)}) > 0, and therefore 1 supp(f p). For a real number α : 0 < α < 1 C2+1, consider the probability of the event (1 α, 1] [0, 1] under distribution f q: Pf q((1 α, 1]) = Pq({x | f (x) (1 α, 1]}). By assumption (3), p(x) < C and q(x) > 0 implies q(x) > 1 C , therefore for x : q(x) > 0 we have f (x) = p(x) p(x) + q(x) < p(x) p(x) + 1 C < C C + 1 C = 1 1 C2 + 1 < 1 α. This means that Pf q((1 α, 1]) = 0, i.e. supp(f q) (1 α, 1] = . To summarize, starting from the assumption that Ex p[d(x, supp(q))] > 0 we showed that 1 supp(f p), Pf p({1}) > 0; supp(f q) (1 α, 1] = . Because D (f p, f q) Et f p d(t, supp(f q)) Pf p({1}) d(1, supp(f q)) Pf p({1}) α > 0, which contradicts with the given D (f p, f q) = 0, we have Ex p[d(x, supp(q))] = 0. 2) Similarly, it can be shown Ex q[d(x, supp(p))] = 0. Thus, D (f p, f q) = 0 implies D (p, q) = 0. A.3 PROOF OF PROPOSITION 2.2 Consider a 1-dimensional Euclidean space R. Let supp(p) = [ 1 2] [1, 2] with p([ 1 4 and p([1, 2]) = 1 4. Let supp(q) = [ 2, 1] [ 1 2] [1, 2] with q([ 2, 1]) = 1 4 and q([1, 2]) = 1 2. The supports of p and q consist of disjoint closed intervals, and we assume uniform distribution within each of these intervals, i.e. p has density p(x) = 3 2]; p(x) = 1 4, x [1, 2] and q has densitiy q(x) = 1 4, x [ 2, 1]; q(x) = 1 2]; q(x) = 1 2, x [1, 2]. Clearly, supp(p) = supp(q). The optimal dual Wasserstein discriminator f W is the maximizer of sup f:Lip(f) 1 Ex p[f(x)] Ey q[f(y)]. Thus, f W is the maximizer of sup f:Lip(f) 1 2 f(x)dx + Z 2 1 f(x)dx Z 1 2 f(x)dx Z 1 2 2 f(x)dx 2 Z 2 which simplifies to sup f:Lip(f) 1 2 f(x)dx + 2 Z 1 2 2 f(x)dx Z 2 Since the optimization objective and the constraint are invariant to replacing the function f(x) with its symmetric reflection g(x) = f( x), if f is a optimal solution, then there exists a symmetric Published as a conference paper at ICLR 2022 maximizer f W (x) = 1 2f ( x), since f W (x) = f W ( x) and Lip(f W ) Lip(f ) 1. Thus, supp(f W p) = supp(f W q) as f W (x) = f W ( x) for x [1, 2]. Note that one can easily extend the above proof to discrete distributions, by replacing the disjoint segments [ 2, 1], [ 1 2], [1, 2] with points { 1}, {0}, {1}. A.4 PROOF OF PROPOSITION 4.1 From (8), we have D W (p, q) := lim β Dβ W (p, q) = lim β inf γ Γβ(p,q) E(x,y) γ[d(x, y)], where limβ Γβ(p, q) is the set of all measures γ on X X such that R γ(x, y)dy = p(x), x and R γ(x, y)dx limβ (1 + β)q(y), y. The set of inequalities Z γ(x, y)dx lim β (1 + β)q(y), y can be simplified to Z γ(x, y)dx = 0, y such that q(y) = 0. To put it together, we have D W (p, q) = inf γ Γ (p,q) E(x,y) γ[d(x, y)], where Γ (p, q) is the set of all measures γ on X X such that R γ(x, y)dy = p(x), x and R γ(x, y)dx = 0, y such that q(y) = 0. In other words, we seek the coupling γ(x, y) which defines a transportation plan such that the total mass transported from given point x is equal to p(x), and the only constraint on the destination points y is that no probability mass can be transported to points y where q(y) = 0, i.e. y / supp(q). Let y (x) denote a function such that y (x) supp(q), x; d(x, y (x)) = inf y supp(q) d(x, y). We can see that γ given by γ (x, y) = p(x)δ(y y (x)), is the optimal coupling. Indeed, γ satisfies the constraints γ Γ , and the cost of any other transportation cost γ Γ is at least that of γ (since y (x) is defined as a closest point y in supp(q) to a given point x). D W (p, q) = inf γ Γ (p,q) E(x,y) γ[d(x, y)] = E(x,y) γ [d(x, y)] = Ex p inf y supp(q) d(x, y) . The last equation implies D W (p, q) = SD(p, q) (SD( , ) is defined in (13)). Then, D , W (p, q) := lim β1,β2 Dβ1,β2 W (p, q) = D W (p, q)+D W (q, p) = SD(p, q)+SD(q, p) = D (p, q). A.5 PROOF OF PROPOSITION 4.2 1. DW (p, q) = 0 implies p = q, which is equivalent to p(x) q(x) = 1, x supp(p) supp(q). Then clearly, for all finite β1, β2 > 0 it satisfies 1 1 + β2 p(x) q(x) 1 + β1, x supp(p) supp(q). (14) Thus, Dβ1,β2 W (p, q) = 0 for all finite β1, β2 > 0. Published as a conference paper at ICLR 2022 2. Dβ1,β2 W (p, q) = 0 for some finite β1, β2 > 0 means that (14) is satisfied. This implies that x supp(p), x supp(q) and x supp(q), x supp(p), which makes supp(p) = supp(q). Thus, D (p, q) = 0. 3. The converse of statements 1 and 2 are false: (a) For all finite β1, β2 > 0, let supp(p) = supp(q) = {x1, x2}. Let p(x1) = p(x2) = 1/2 and q(x1) = (1 + β )/2 and q(x2) = (1 β )/2 where β = min β2, 1 1 1 + β1 Then, it can be easily checked that (14) is satisfied, which makes Dβ1,β2 W (p, q) = 0. However, since β = 0, p = q and thus DW (p, q) = 0. (b) Similar to (a), let supp(p) = supp(q) = {x1, x2}. Let p(x1) = q(x2) = ε and p(x2) = q(x1) = 1 ε for some ε > 0. Since supp(p) = supp(q), D (p, q) = 0. However, lim ε 0 p(x1) q(x1) = lim ε 0 ε 1 ε = 0, and, thus, for any finite β2 > 0 we can choose ε > 0 such that p(x1) q(x1) = ε 1 ε < 1 1 + β2 . Therefore, (14) is not satisfied and Dβ1,β2 W (p, q) = 0. A.6 PROOF OF PROPOSITION 4.3 Using (2), we first establish a connection between the pushforward distributions f p and f q. Proposition A.1. Let f be the optimal log-loss discriminator (2) between p and q. Then, [f p](t) [f p](t) + [f q](t) = t, t supp(f p) supp(f q). (15) Proof. For any point t supp(f p) supp(f q), the values of the densities [f p](t) = lim ε 0 Pp ({x | t ε < f (x) < t + ε}) 2ε = lim ε 0 {x|t ε 0, and taking the limit ε 0 we obtain [f p](t) [f p](t) + [f q](t) = t. Figure A.1: Visual illustration of the statement of Proposition A.1. The top-left panel shows two example PDFs p(x), q(x) on closed interval [ 2, 2]. The bottom-left panel shows the optimal discriminator function f (x) = p(x)/(p(x) + q(x)) as a function of x on [ 2, 2]. The top-right panel shows the PDFs [f p](t), [f q](t) of the pushforward distributions f p, f q induced by the discriminator mapping f . f maps [ 2, 2] to [0, 1] and [f p], [f q] are defined on [0, 1]. Consider point x1 [ 2, 2]. The value f (x1) characterizes the ratio of densities p(x1)/(p(x1) + q(x1)) at x1. For another point x2 mapped to the same value f (x2) = f (x1) = t1,2, the ratio of densities p(x2)/(p(x2) + q(x2)) is the same as p(x1)/(p(x1) + q(x1)). All points x mapped to t1,2 share the same ratio of the densities p(x)/(p(x) + q(x)). This fact implies that the ratio of the pushforward densities [f p](t1,2)/([f p](t1,2) + [f q](t1,2)) at t1,2 must be the same as the ratio of densities p(x1)/(p(x1) + q(x1)) = t1,2 at x1 (or x2). The pushforward PDFs [f q](t), [f q](t) satisfy property [f p](t)/([f p](t) + [f q](t)) = t for all t supp(f p) supp(f q). Comment. Intuitively this proposition states the following. If for some x X we have f (x) = t [0, 1], t directly corresponds to the ratio of densities not only in the original space t = p(x)/(p(x) + q(x)), but also in the 1D discriminator output space t = [f p](t)/([f p](t) + [f q](t)). We also provide an intuitive example in Figure A.1. Proof of Proposition 4.3, statement #1. = : If DW (p, q) = 0 then p = q, then f p = f q. Thus, DW (f p, f q) = 0. Published as a conference paper at ICLR 2022 =: If DW (f p, f q) = 0, then f p = f q. Consider probability of event t t > 1 2 under distribution f p. = Z I f (x) > 1 where I[ ] is the indicator function (I[c] equal to 1 when the condition c is satisfied, and equal to 0 otherwise). For all x : p(x) > 0, we have that f (x) = p(x) p(x)+q(x) and p(x) + q(x) > 0. Therefore, the expression above can be re-written as = Z I p(x) p(x) + q(x) > 1 p(x) dx = Z I[p(x) q(x) > 0]p(x) dx. Similarly, the probability of event t t > 1 2 under distribution f q is = Z I[p(x) q(x) > 0]q(x) dx. f p = f q implies that or equivalently Z I[p(x) q(x) > 0](p(x) q(x)) dx = 0. Note, that the function I[p(x) q(x) > 0](p(x) q(x)) is non-negative for any x. This means that the integral can be zero only if the function is zero everywhere implying that for any x either I[p(x) q(x) > 0] = 0 or p(x) q(x) = 0. In other words, p(x) q(x), x. Using the fact the both densities p(x) and q(x) must sum up to 1, we conclude that p = q and DW (p, q) = 0. Proof of Proposition 4.3, statement #2. Note that by (2) and (15), we have f (x) = p(x) p(x) + q(x) = t = [f p](t) [f p](t) + [f q](t), x supp(p) supp(q). = : Suppose Dβ1,β2 W (p, q) = 0. Then supp(p) = supp(q) = S (by Proposition 4.2) and supp(f p) = supp(f q) = T by (Theorem 2.1). Moreover, Dβ1,β2 W (p, q) = 0 implies 1 1 + β2 p(x) q(x) 1 + β1, x S. f (x) = p(x) p(x) + q(x) = p(x) q(x) 1 + p(x) q(x) , x S, the inequalities above are equivalent to 1 2 + β2 f (x) 1 + β1 2 + β1 , x S. Published as a conference paper at ICLR 2022 Combined with Proposition A.1, the above implies that 1 2 + β2 [f p](t) [f p](t) + [f q](t) 1 + β1 2 + β1 , t T, or equivalently 1 1 + β2 [f p](t) [f q](t) 1 + β1, t T. Therefore, Dβ1,β2 W (f p, f q) = 0. =: similarly, when Dβ1,β2 W (f p, f q) = 0, supp(f p) = supp(f q) = T = supp(p) = supp(q) = S. Moreover, 1 1 + β2 [f p](t) [f q](t) 1 + β1, t T, 1 2 + β2 [f p](t) [f p](t) + [f q](t) 1 + β1 2 + β2 , t T 1 2 + β2 f (x) 1 + β1 2 + β1 , x S, 1 1 + β2 p(x) q(x) 1 + β1, x S. Therefore, Dβ1,β2 W (p, q) = 0. B A COMMENT ON SLICED SSD DIVERGENCE Recent works (Deshpande et al., 2018; 2019) have proposed to perform optimal transport (OT)-based distribution alignment by reducing the OT problem (7) in the original, potentially high-dimensional space, to that between 1D distributions. Specifically, Deshpande et al. (2018) consider the sliced Wasserstein distance (Rabin et al., 2011): DSW (p, q) = Z Sn 1 DW (f θ p, f θ q) dθ, (16) where Sn 1 = {θ Rn | θ = 1} is a unit sphere in Rn, and f θ is a 1D linear projection f θ(x) = θ, x . It is known that DSW is a valid distribution divergence: for any p = q there exists a linear slicing function f θ , θ Sn 1 which identifies difference in the distributions, i.e. f θ p = f θ q (Cramér Wold theorem). By reducing the original OT problem to that in a 1D space, Wu et al. (2019a) and Deshpande et al. (2019) develop efficient practical methods for distribution alignment based on fast algorithms for the 1D OT problem. Unfortunately, the straight-forward extension of SSD divergence (1) to a 1D linearly sliced version does not provide a valid support divergence. Proposition B.1. There exist two distributions p and q in P, such that supp(p) = supp(q) but supp(f θ p) = supp(f θ q), f θ(x) = θ, x with θ Sn 1. Proof. Consider a 2-dimensional Euclidean space R2 and let supp(p) = {(x, y)|x2 + y2 2} and supp(q) = {(x, y)|1 x2 + y2 2}. Then, f θ(x) = θ, x with θ S1, supp(f θ p) = supp(f θ q) = [ 2, 2]. This counterexample is shown in Figure B.1. Published as a conference paper at ICLR 2022 Figure B.1: Visualization of example distributions for Proposition B.1 C DISCUSSION OF SOFT AND HARD ASSIGNMENTS WITH 1D DISCRETE DISTRIBUTIONS In Section 4.2 we considered the soft-assignment relaxed OT problem (11) and claimed that for integer β, the set of minimizers of (11) must contain a hard-assignment transportation plan, meaning γij {0, 1}, i, j. Below we justify this claim. Note that for β = 0 the OT problem (11) is the standard OT problem for Wasserstein-1 distance (10), since the inequality constraints Pm i=1 γij 1, j can only be satisfied as equalities. For this problem, it is known (e.g. see Peyré et al. (2019) Proposition 2.1) that the set of optimal soft-assignment contains a hard-assignment represented by a normalized permutation matrix. This fact can be proven using the Birkhoff von Neumann theorem. The Birkhoff von Neumann theorem states that the set of doubly stochastic matrices P Rn n : Pij 0, i, j, j=1 Pij = 1, i, i=1 Pij = 1, j is exactly the set of all finite convex combinations of permutation matrices. In the context of the linear program (11) with β = 0, the Birkhoff von Neumann theorem means that all extreme points of the polyhedron Γβ(op, oq) are hard-assignment matrices. Therefore, by the fundamental theorem of linear programming (Bertsimas & Tsitsiklis, 1997), the minimum of the objective is reached at a hard-assignment matrix. We argue that a similar result holds for the case of integer β > 0. In this case, the matrices in Γβ(op, oq) can not be associated with the doubly stochastic matrices, since constraints on of the marginals of γ are relaxed to inequality constraints. Because of that, the Birkhoff von Neumann theorem can not be applied. However, Budish et al. (2009) provide a generalization of the Birkhoff von Neumann theorem (Theorem 1 in (Budish et al., 2009)) which applies to the cases where the equality constraints are replaced with integer-valued inequality constraints (recall that we consider integer β). Using this generalized result, our claim can be proven by performing the following steps. Clearly, the polyhedron Γβ(op, oq) contains all hard-assignment matrices and all their finite convex combinations. The result proven in (Budish et al., 2009) implies that each element of Γβ(op, oq) can be represented as a finite convex combination of hard-assignment matrices. Thus, the polyhedron Γβ(op, oq) is exactly the set of all finite convex combinations of hard-assignment matrices and all extreme points of the polyhedron are hard-assignment matrices. Finally, by analogy with the case of β = 0, we invoke the fundamental theorem of the linear programming and conclude that the minimum of the objective (11) is reached at γ corresponding to a hard-assignment matrix. Published as a conference paper at ICLR 2022 D EXPERIMENT DETAILS D.1 USPS TO MNIST EXPERIMENT SPECIFICATIONS We use USPS (Hull, 1994) and MNIST (Le Cun et al., 1998) datasets for this adaptation problem. Following Tachet des Combes et al. (2020) we use Le Net-like (Le Cun et al., 1998) architecture for the feature extractor with the 500-dimensional feature representation. The classifier consists of a single linear layer. The discriminator is implemented by a 3-layer MLP with 512 hidden units and leaky-Re LU activation. We train all methods for 65 000 steps with batch size 64. We train the feature extractor, the classifier, and the discriminator with SGD (learning rate 0.02, momentum 0.9, weight decay 5 10 4). We perform a single discriminator update per 1 update of the feature extractor and the classifier. After the first 30 000 steps we linearly anneal the feature extractor s and classifier s learning rates for 30 000 steps to the final value 2 10 5. The feature extractor s loss is given by a weighted combination of the cross-entropy classification loss on the labeled source example and a domain alignment loss computed from the discriminator s signal (recall that different method use different forms of the alignment loss). The weight for the classification term is constant and set to λcls = 1. We introduce schedule for the alignment weight λalign. For all alignment methods we linearly increase λalign from 0 to 1.0 during the first 10000 steps. For ASA we use history buffers of size 1000. D.2 STL TO CIFAR EXPERIMENT SPECIFICATIONS We use STL (Coates et al., 2011) and CIFAR-10 (Krizhevsky, 2009) for this adaptation task. STL and CIFAR-10 are both 10-class classification problems. There are 9 common classes between the two datasets. Following Shu et al. (2018) we create a 9-class classification problem by selecting the subsets of examples of the 9 common classes. For the feature extractor, we adapt the deep CNN architecture of Shu et al. (2018). The feature representation is a 192-dimensional vector. The classifier consists of a single linear layer. The discriminator is implemented by a 3-layer MLP with 512 hidden units and leaky-Re LU activation. We train all methods for 40 000 steps with batch size 64. We train the feature extractor, the classifier, and the discriminator with ADAM (Kingma & Ba, 2014) (learning rate 0.001, β1 = 0.5, β2 = 0.999, no weight decay). We perform a single discriminator update per 1 update of the feature extractor and the classifier. The weight for the classification loss term is constant and set to λcls = 1. For all alignment methods we use constant alignment weight λalign = 0.1. For ASA we use history buffers of size 1000. Conditional entropy loss. Following (Shu et al., 2018) we use auxiliary conditional entropy loss on target examples for domain adaptation methods. For a classifier Cϕ : Z Y and a feature extractor F θ : X Z where classifier outputs the distribution over class labels {1, . . . , K} Cϕ(z) RK : Cϕ(z) the conditional entropy loss on target examples {xq i }Nq i=1 is given by Lent = λent 1 Cϕ(F θ(xq i )) k log Cϕ(F θ(xq i )) λent is the weight of the conditional entropy loss in the total training objective. This loss acts as an additional regularization of the embeddings of the unlabeled target examples: minimization of the conditional entropy pushes target embeddings away from the classifier s decision boundary. For all domain adapation methods we use the conditional entropy loss (17) on target examples with the weight λent = 0.1. Published as a conference paper at ICLR 2022 D.3 EXTENDED EXPERIMENTAL RESULTS ON STL TO CIFAR Effect of conditional entropy loss. In order to quantify the improvements of the support alignment objective and the conditional entropy objective in separation, we conduct an ablation study. In addition to the results reported in Section 5, we evaluate all domain adaptation methods (except VADA which uses the conditional entropy in the original implementation) on STL CIFAR task without the conditional entropy loss (λent = 0). The results of the ablation study are presented in Table D.1. We observe that the effect of the auxiliary conditional entropy is essentially the same for all methods across all imbalance levels: with λent = 0.1 the accuracy either improves (especially the average class accuracy) or roughly stays on the same level. The relative ranking of distribution alignment, relaxed distribution alignment, and support alignment methods is the same with both λent = 0 and λent = 0.1. The results demonstrate that the benefits of support alignment approach and conditional entropy are orthogonal. Table D.1: Results of ablation experiments of the effect of auxiliary conditional entropy loss on STL CIFAR data. Same setup and reporting metrics as Table 1. α = 0.0 α = 1.0 α = 1.5 α = 2.0 Algorithm λent average min average min average min average min DANN 0.0 74.6 75.1 74.1 51.5 55.0 49.9 68.4 69.2 67.0 43.2 43.7 41.2 65.7 65.9 62.8 35.5 36.2 29.6 62.5 64.6 60.0 27.5 27.5 25.7 DANN 0.1 75.3 75.4 74.9 54.6 56.6 54.2 69.9 70.1 68.6 44.8 45.1 40.7 64.9 67.1 63.7 34.9 36.8 33.9 63.3 64.8 57.4 27.0 28.5 21.2 IWDAN 0.0 70.4 70.7 70.2 47.2 48.0 46.8 68.6 68.8 68.4 43.6 46.3 43.2 66.7 67.9 66.0 44.7 46.2 43.3 63.9 66.1 62.9 36.5 37.3 32.7 IWDAN 0.1 69.9 70.7 69.9 50.5 50.6 47.9 68.7 69.1 68.6 45.8 50.5 44.8 67.1 67.3 65.9 44.7 44.8 40.4 64.4 64.9 63.6 36.8 37.9 34.5 IWCDAN 0.0 70.1 70.8 70.0 50.5 50.8 49.1 68.6 69.4 68.2 44.2 45.8 41.2 66.0 66.0 65.9 45.0 47.8 43.7 63.8 64.1 62.3 37.3 37.7 33.6 IWCDAN 0.1 70.1 70.2 70.1 47.8 49.3 42.4 69.4 69.4 69.1 47.1 51.3 46.3 66.1 67.2 65.0 39.9 40.8 37.7 64.5 65.1 63.9 37.0 40.2 35.5 s DANN-4 0.0 69.4 70.0 68.8 46.5 49.7 45.1 69.6 69.7 69.3 49.1 49.2 47.4 68.0 68.6 67.8 48.2 48.8 42.6 66.3 66.4 64.2 40.7 42.9 36.6 s DANN-4 0.1 71.8 72.1 71.7 52.1 52.8 52.1 71.1 71.7 70.4 49.9 51.8 48.1 69.4 70.0 68.7 48.6 49.0 43.5 66.4 67.9 66.2 39.0 47.1 33.6 ASA-sq 0.0 69.9 70.3 69.9 48.0 50.1 46.6 68.8 68.9 68.6 47.3 49.3 45.3 68.1 68.7 67.2 45.4 47.8 45.2 65.7 66.4 65.6 43.6 45.0 41.3 ASA-sq 0.1 71.7 71.9 71.7 52.9 53.4 46.7 70.7 71.0 70.4 51.6 52.7 46.8 69.2 69.3 69.2 45.6 52.0 43.3 68.1 68.2 67.2 44.7 45.9 39.8 ASA-abs 0.0 69.8 70.0 68.9 45.7 48.0 45.4 68.4 68.6 68.4 44.3 46.8 44.0 67.9 68.1 67.0 46.6 48.4 40.4 66.3 66.9 65.7 41.6 44.9 40.3 ASA-abs 0.1 71.6 71.7 71.2 49.0 53.5 48.4 70.9 71.0 70.8 49.2 50.0 47.3 69.6 69.9 69.6 43.2 49.5 42.1 67.8 68.2 66.6 40.9 49.0 35.4 Comparison with optimal transport based baselines. We provide additional experimental results comparing our method with OT-based methods for domain adaptation. We implement two OT-based methods which we describe below. The first method is a variant of the max-sliced Wasserstein distance (which was proposed for GAN training by Deshpande et al. (2019)) for domain adaptation. In the table below we refer to this method as DANN-OT. In our implementation DANN-OT minimizes the Wasserstein distance between the pushforward distributions g p, g q induced by the optimal log-loss discriminator g (4). As discussed in Section 4.2 (paragraph Distribution alignment ) the computation of the Wasserstein distance between 1D distributions can be implemented efficiently via sorting. The second method is an OT-based variant of DANN which uses a dual Wasserstein discriminator instead of the log-loss discriminator. In the table below we refer to this method as DANN-WGP. This method minimizes the Wasserstein distance in its dual Kantorovich form. We train the discriminator with the Wasserstein dual objective and a gradient penalty proposed to enforce Lipshitz-norm constraint (Gulrajani et al., 2017). We present the evaluation results of the OT-based methods on STL CIFAR domain adaptation task in the Table D.2. Note that the OT-based methods aim to enforce distribution alignment constraints. We observe that the OT-based methods follow the same trend as DANN: they deliver improved accuracy compared to No DA in the balanced setting, but suffer in the imbalanced settings (α > 0) due to their distribution alignment nature. We would also like to make a comment on OT-based relaxed distribution alignment. Wu et al. (2019b) propose method WDANN-β which minimizes the dual form of the asymmetrically-relaxed Published as a conference paper at ICLR 2022 Table D.2: Results of comparison with optimal transport based methods on STL CIFAR data. Same setup and reporting metrics as Table 1. α = 0.0 α = 1.0 α = 1.5 α = 2.0 Algorithm average min average min average min average min No DA 69.9 70.0 69.8 49.8 50.6 45.3 68.8 69.3 68.3 47.2 48.2 45.3 66.8 67.2 66.4 46.0 47.0 45.8 65.8 66.7 64.8 43.7 44.6 41.6 DANN 75.3 75.4 74.9 54.6 56.6 54.2 69.9 70.1 68.6 44.8 45.1 40.7 64.9 67.1 63.7 34.9 36.8 33.9 63.3 64.8 57.4 27.0 28.5 21.2 DANN-OT 76.0 76.0 75.8 55.2 55.5 54.3 67.7 68.9 67.1 43.0 43.7 36.5 64.5 65.1 60.9 34.4 34.6 29.3 61.3 62.0 54.4 24.3 25.5 23.2 DANN-WGP 74.8 75.1 74.7 53.5 54.4 53.3 67.7 67.9 65.3 38.6 41.0 34.4 63.3 63.4 57.1 27.0 32.4 26.3 59.0 61.8 54.3 21.9 22.5 18.6 s DANN-4 71.8 72.1 71.7 52.1 52.8 52.1 71.1 71.7 70.4 49.9 51.8 48.1 69.4 70.0 68.7 48.6 49.0 43.5 66.4 67.9 66.2 39.0 47.1 33.6 ASA-sq 71.7 71.9 71.7 52.9 53.4 46.7 70.7 71.0 70.4 51.6 52.7 46.8 69.2 69.3 69.2 45.6 52.0 43.3 68.1 68.2 67.2 44.7 45.9 39.8 ASA-abs 71.6 71.7 71.2 49.0 53.5 48.4 70.9 71.0 70.8 49.2 50.0 47.3 69.6 69.9 69.6 43.2 49.5 42.1 67.8 68.2 66.6 40.9 49.0 35.4 Wasserstein distance. However, they observe that s DANN-β outperforms WDANN-β in experiments. Hence, we use s DANN-β as a relaxed distribution alignment baseline in our experiments. Effect of alignment weight. We provide additional experimental results comparing ASA with DANN and VADA across different values of the alignment loss weight λalign on STL CIFAR task. The results are shown in Table D.3. DANN with a higher alignment weight λalign = 1.0 performs better in the balanced (α = 0) setting and worse in the imbalanced (α > 0) setting compared to a lower weight λalign = 0.1, as the distribution alignment constraint is enforced stricter. VADA optimizes a combination of distribution alignment + VAT (virtual adversarial training) objectives (Shu et al., 2018), and we observe the same trend: with lower alignment weight λalign = 0.01, VADA performs worse in the balanced setting and better in the imbalanced setting compared to a higher weight λalign = 0.1. Weight λalign = 0.1 is a middle ground between having poor performance in the imbalanced setting (λalign = 1.0) and not sufficiently enforcing distribution alignment (λalign = 0.01). The role of VAT (similarly to that of conditional entropy loss) is orthogonal to alignment objectives. Thus, we provide additional evaluations of combining support alignment and VAT (the ASA-sq + VAT entry in Table D.3) with alignment weight λalign = 1.0. The good performance of such combination shows that: One could improve our current support alignment performance by using auxiliary objectives. Support alignment based method performs qualitatively different from distribution alignment based method, since the performance holds with stricter support alignment while distribution alignment needs to loosen the constraints considerably to reduce the performance degradation in the imbalanced setting. Table D.3: Results of comparison of ASA with DANN and VADA across different values of the alignment loss weight λalign on STL CIFAR data. Same setup and reporting metrics as Table 1. α = 0.0 α = 1.0 α = 1.5 α = 2.0 Algorithm λalign average min average min average min average min DANN 0.01 72.3 72.7 72.2 49.5 50.8 48.8 70.6 71.2 69.7 48.9 51.2 41.5 68.5 68.7 67.2 46.1 50.0 36.2 65.9 66.0 64.1 36.7 39.4 29.9 DANN 0.1 75.3 75.4 74.9 54.6 56.6 54.2 69.9 70.1 68.6 44.8 45.1 40.7 64.9 67.1 63.7 34.9 36.8 33.9 63.3 64.8 57.4 27.0 28.5 21.2 DANN 1.0 77.2 77.3 76.8 58.5 59.4 56.7 66.3 66.8 64.5 37.9 41.6 37.5 62.8 63.3 56.1 27.5 28.9 24.6 58.7 59.7 52.3 18.5 20.5 17.2 VADA 0.01 74.4 74.4 74.2 54.2 55.4 52.6 71.7 71.7 71.7 51.6 52.0 45.0 69.5 69.7 68.4 47.5 49.8 40.0 65.9 66.1 64.8 37.2 39.4 35.3 VADA 0.1 76.7 76.7 76.6 56.9 58.3 53.5 70.6 71.0 70.0 47.7 48.8 44.0 66.1 66.5 65.4 35.7 39.3 33.3 63.2 64.7 60.2 25.5 28.0 25.2 ASA-sq 0.1 71.7 71.9 71.7 52.9 53.4 46.7 70.7 71.0 70.4 51.6 52.7 46.8 69.2 69.3 69.2 45.6 52.0 43.3 68.1 68.2 67.2 44.7 45.9 39.8 ASA-sq + VAT 1.0 74.2 74.5 74.0 52.2 52.5 51.9 72.2 72.2 71.9 53.5 53.6 45.4 70.6 70.8 70.4 48.9 52.3 45.6 67.4 67.7 66.8 43.0 46.0 39.4 Published as a conference paper at ICLR 2022 D.4 VISDA-17 EXPERIMENT SPECIFICATIONS We use train and validation sets of the Vis DA-17 challenge (Peng et al., 2017). For the feature extractor we use Res Net-50 He et al. (2016) architecture with modified output size of the final linear layer. The feature representation is 256-dimensional vector. We use the weights from pre-trained Res Net-50 model (torchvision model hub) for all layers except the final linear layer. The classifier consists of a single linear layer. The discriminator is implemented by a 3-layer MLP with 1024 hidden units and leaky-Re LU activation. We train all methods for 50 000 steps with batch size 36. We train the feature extractor, the classifier, and the discriminator with SGD. For the feature extractor we use learning rate 0.001, momentum 0.9, weight decay 0.001. For the classifier we use learning rate 0.01, momentum 0.9, weight decay 0.001. For the discriminator we use learning rate 0.005, momentum 0.9, weight decay 0.001. We perform a single discriminator update per 1 update of the feature extractor and the classifier. We linearly anneal the feature extractor s and classifier s learning rate throughout the training (50 000) steps. By the end of the training the learning rates of the feature extractor and the classifier are decreased by a factor of 0.05. The weight for the classification term is constant and set to λcls = 1. We introduce schedule for the alignment weight λalign. For all alignment methods we linearly increase λalign from 0 to 0.1 during the first 10000 steps. For all methods we use auxiliary conditional entropy loss on target examples with the weight λent = 0.1. For ASA we use history buffers of size 1000. D.5 HISTORY SIZE EFFECT AND EVALUATION OF SUPPORT DISTANCE History size effect. To quantify the effects of mini-batch training mentioned in Section 3, we explore different sizes of history buffers on USPS MNIST task with the label distribution shift α = 1.5. The results are presented in Figure D.1 and Table D.4. Figure D.2 shows the distributions of outputs of the learned discriminator at the end of the training. While without any alignment objectives neither the densities nor the supports of gψ pθ Z and gψ qθ Z are aligned, both alignment methods approximately satisfy their respective alignment constraints. Compared with DANN results, ASA with small history size performs similarly to distribution alignment, while all history sizes are enough for support alignment. We also observe the correlation between distribution distance and target accuracy: under label distribution shifts, the better distribution alignment is achieved, the more target accuracy suffers. Note that with too big history buffers (e.g. n = 5000), we observe a sudden drop in performance and increases in distances. We hypothesize that this could be caused by the fact that the history buffer stores discriminator output values from the past steps while the discriminator parameters constantly evolve during training. As a result, for a large history buffer, the older items might no longer accurately represent the current pushforward distribution as they become outdated. 0 100 500 1000 2000 5000 ASA history size Minimum class accuracy (%) ASA DANN No DA 0 100 500 1000 2000 5000 ASA history size Distribution distance 0 100 500 1000 2000 5000 ASA history size 10 2 5 10 2 10 1 Support distance Figure D.1: Evaluation of history size effect for ASA on MNIST USPS with the label distribution shift (α = 1.5). The panels show (left to right): minimum class accuracy on target test set; Wasserstein distance DW (gψ pθ Z, gψ qθ Z) between the pushforward distributions of source and target representations induced by the discriminator; SSD divergence D (gψ pθ Z, gψ qθ Z) between the pushforward distributions. In each panel the dashed lines show the respective quantities for No DA and DANN methods. Published as a conference paper at ICLR 2022 Direct evaluation of support distance. In order to directly evaluate the ability of ASA (with history buffers) to enforce support alignment, we consider the setting of the illustrative experiment described in Section 5 (3-class USPS MNIST adaptation with 2D feature extractor, α = 1.5). We compare methods No DA, DANN, and ASA-abs (with different history buffer sizes). For each method we consider the embedding space of the learned feature extractor at the end of training and compute Wasserstein distance DW (pθ Z, qθ Z) and SSD divergence D (pθ Z, qθ Z) between the embeddings of source and target domain (note that we compute the distances in the original embedding space directly without projecting data to 1D with the discriminator). To ensure meaningful comparison of the distances between different embedding spaces, we apply a global affine transformation for each embedding space: we center the embeddings so that their average is 0 and re-scale them so that their average norm is 1. The results of this evaluation are shown in Table D.5. We observe that, compared to no alignment and distribution alignment (DANN) methods, ASA aligns the supports without necessarily aligning the distributions (in this imbalanced setting, distribution alignment implies low adaptation accuracy). Table D.4: Analysis of effect history size parameter for ASA on USPS MNIST with class label distribution shift corresponding to α = 1.5. We report distribution and support distances between the pushforward distributions gψ pθ Z and gψ qθ Z, as well as the value of discriminator s log-loss. Target accuracy (%) Distribution distances Method History size average min DW (gψ pθ Z, gψ qθ Z) D (gψ pθ Z, gψ qθ Z) Log-loss No DA 71.28 72.51 71.25 27.46 37.26 24.21 307.56 322.00 277.33 40.35 46.06 32.10 00.05 00.07 00.04 DANN 69.96 71.25 63.89 01.11 01.53 00.99 00.11 00.11 00.10 00.00 00.00 00.00 00.65 00.65 00.65 ASA-abs 0 62.75 64.35 61.78 19.36 23.63 17.90 01.07 01.15 00.99 00.01 00.01 00.00 00.57 00.58 00.56 ASA-abs 100 80.58 81.73 78.22 35.09 44.37 32.10 02.64 02.70 02.15 00.00 00.00 00.00 00.53 00.53 00.52 ASA-abs 500 92.02 92.76 90.56 76.96 83.72 70.94 06.21 06.48 05.69 00.00 00.01 00.00 00.45 00.45 00.45 ASA-abs 1000 92.54 92.93 90.90 82.41 85.43 74.53 08.06 08.19 07.97 00.01 00.02 00.01 00.41 00.41 00.40 ASA-abs 5000 86.03 87.50 84.86 62.19 71.62 46.98 29.23 29.63 24.54 00.05 00.08 00.05 00.29 00.30 00.29 30 20 10 0 10 20 30 0.00 0.02 0.04 0.06 0.08 0.10 0.12 0.14 g p (source) g q (target) 4 2 0 2 4 0.0 6 4 2 0 2 4 6 0.0 (c) ASA (n = 0) 8 6 4 2 0 2 4 6 8 0.00 0.05 0.10 0.15 0.20 0.25 0.30 0.35 (d) ASA (n = 1000) Figure D.2: Kernel density estimates (in the discriminator output space) of gψ pθ Z, gψ qθ Z at the end of the training on USPS MNIST task with α = 1.5. n is the size of ASA history buffers. Published as a conference paper at ICLR 2022 Table D.5: Results of No DA, DANN, and ASA-abs (with different history sizes) on 3-class USPS MNIST adaptation with 2D feature extractor and label distribution shift corresponding to α = 1.5. We report average and minimum target class accuracy, as well as Wasserstein distance DW and support divergence D between source pθ Z and target qθ Z 2D embedding distributions. We report median (the main number), and 25 (subscript) and 75 (superscript) percentiles across 5 runs. Algorithm History size Accuracy (avg) Accuracy (min) DW (pθ Z, qθ Z) D (pθ Z, qθ Z) No DA 63.0 69.6 62.3 45.3 53.6 37.9 0.78 00.84 00.75 0.10 0.10 0.10 DANN 75.6 83.7 72.4 54.8 55.1 49.6 0.07 0.08 0.06 0.02 0.02 0.02 ASA-abs 0 73.9 84.1 73.4 61.8 72.4 54.6 0.23 0.47 0.22 0.03 0.03 0.03 ASA-abs 100 88.5 95.1 86.8 71.4 93.3 70.6 0.54 0.36 0.56 0.03 0.03 0.03 ASA-abs 500 94.5 94.7 88.7 89.0 90.3 83.1 0.59 0.64 0.55 0.03 0.03 0.03 ASA-abs 1000 91.1 93.0 91.1 85.6 86.2 80.7 0.59 0.62 0.55 0.03 0.03 0.03 ASA-abs 2000 94.0 94.7 91.2 88.6 89.4 80.2 0.62 0.66 0.58 0.03 0.03 0.03 ASA-abs 5000 82.1 83.9 81.8 68.9 70.9 65.5 0.64 0.67 0.63 0.04 0.04 0.04