# discrepancybased_active_learning_for_domain_adaptation__87cf9e16.pdf Published as a conference paper at ICLR 2022 DISCREPANCY-BASED ACTIVE LEARNING FOR DOMAIN ADAPTATION Antoine de Mathelin1,2, Franc ois Deheeger1, Mathilde Mougeot3,2, Nicolas Vayatis2 1Michelin, 2Centre Borelli, Universit e Paris-Saclay, CNRS, ENS Paris-Saclay, 3ENSIIE The goal of the paper is to design active learning strategies which lead to domain adaptation under an assumption of Lipschitz functions. Building on previous work by Mansour et al. (2009) we adapt the concept of discrepancy distance between source and target distributions to restrict the maximization over the hypothesis class to a localized class of functions which are performing accurate labeling on the source domain. We derive generalization error bounds for such active learning strategies in terms of Rademacher average and localized discrepancy for general loss functions which satisfy a regularity condition. A practical Kmedoids algorithm that can address the case of large data set is inferred from the theoretical bounds. Our numerical experiments show that the proposed algorithm is competitive against other state-of-the-art active learning techniques in the context of domain adaptation, in particular on large data sets of around one hundred thousand images. 1 INTRODUCTION Machine learning models trained on a labeled data set from a source domain may fail to generalize on new target domains of interest (Saenko et al., 2010). This issue, which can be caused by domain shift, can be handled when no target labels are available through unsupervised domain adaptation methods (Ganin et al., 2016). Using a small sample of labeled target data can, besides, greatly improve the model performances (Motiian et al., 2017). Acquiring such new labels is often expensive (Settles, 2010) and one seeks to query as few labels as possible. This explains why strategies of optimal labels acquisition, referred as active learning (Cohn et al., 1994) seem very promising for domain adaptation (Su et al., 2020). Active learning is a challenging task and a broad literature exists. On the one hand, some active learning methods introduce heuristic approaches which provide the benefit of using practical algorithm based on simple criteria. For instance, the spatial coverage of the target domain (Hu et al., 2010; Bod o et al., 2011) or the minimization of target uncertainties (Ray Chaudhuri & Hamey, 1995; Gal et al., 2017) are considered, as well as combination of these heuristics (Wei et al., 2015; Kaushal et al., 2019). However finding the proper heuristics is not straightforward and previous methods do not link their query strategy with the target risk (Viering et al., 2019). On the other hand, active learning methods based on distribution matching aim at minimizing a distribution distance between the labeled set and the unlabeled one (Balcan et al., 2009; Wang & Ye, 2015; Viering et al., 2019). These methods provide theoretical guarantees on the target risk through generalization bounds. However the computation of the distances is either not scalable to large-scale data sets (Balcan et al., 2009; Wang & Ye, 2015; Viering et al., 2019) or based on adversarial training (Su et al., 2020; Shui et al., 2020) which involves complex hyper-parameter calibration (Kurach et al., 2019). In this work, we propose to address the issue of active learning for general loss functions under domain shift through a distribution matching approach based on discrepancy minimization (Mansour et al., 2009). In Section 2, we derive theoretical results by adopting a localized discrepancy distance (Zhang et al., 2020) between the labeled and unlabeled empirical distributions. This localized discrepancy is defined as the plain discrepancy considered on a hypothesis space restricted to hypotheses close to the labeling function on the labeled data set. This distance has the benefit to focus only on relevant Correspondence at antoine.de-mathelin-de-papigny@michelin.com Published as a conference paper at ICLR 2022 candidates for approximating the labeling function and thus provides tighter bound of the target risk under some given assumptions (Zhang et al., 2020). Based on this distance, we provide a generalization bound of the target risk involving pairwise distances between sample points (Theorem 1). Inspired by this generalization error bound, we propose in Section 3 an accelerated K-medoids query algorithm which scales to large data set. In Section 4, we present the related works and analytically show that our proposed approach displays tighter theoretical control of the target risk than the one provided by recent active learning methods. We finally present in Section 5 the benefit of the proposed approach on several empirical regression and classification active learning problems in the context of domain adaptation. 2 DISCREPANCY BASED ACTIVE LEARNING Setup and theory develop in this section mainly focus on regression tasks. Section 4.3 presents how the algorithm derived from the theoretical results can be extended to classification tasks. 2.1 SETUP AND DEFINITIONS Given two subsets X Rp and Y Rq and d : X X R+ a distance on X, we denote the source data set S = {x1, ..., xm} X m and the target data set T = {x 1, ..., x n} X n. We consider the domain shift setting where the respective data sets S and T are drawn according to two different distributions Q and P on X. We consider a loss function L : Y Y R+ and a hypothesis space H of k-Lipschitz functions from X to Y. We denote by LD(h, h ) = Ex D[L(h(x), h (x))] the average loss (or risk) over any distribution D on X between two hypotheses h, h H. We also define the expected Rademacher complexity of H for the distribution P as: Rn(H) = E {x i}i P i=1 σih(x i) with σi drawn according to U the uniform distribution on { 1, 1}. We consider a labeling function for each distribution: f Q : X Y and f P : X Y. For adaptation to be possible, the two labeling functions are supposed to be close (Mansour et al., 2009). We finally consider the single-shot batch active learning framework (Viering et al., 2019) where all queried data are picked in one single batch of fixed budget of K > 0 queries. In this framework, an active learning algorithm takes as inputs the source data set S along with its corresponding recorded labels as well as the target data set T . The algorithm then returns a batch of K queried target data denoted TK T . The corresponding labels for TK are then recorded through an oracle and used along with the source labeled data to fit an hypothesis h H. We denote by LK = S TK the labeled data set. The goal is to select the K target data to label in order to minimize the target risk of h : LP (h, f P ). 2.2 LOCALIZED DISCREPANCY To formulate the problem of active learning under domain shift as a distribution matching problem, one needs to consider a measure of divergence between distributions. Recent interest focuses on the discrepancy (Mansour et al., 2009) which proves to be useful for domain adaptation (Cortes & Mohri, 2014; Zhang et al., 2019) and is recently used on the active learning setting (Viering et al., 2019). However this metric, defined as a maximal difference between domain losses over the whole hypothesis space, is relatively conservative as it includes hypotheses that the learner might not ever consider as candidates for the labeling function (Cortes et al., 2019c; Zhang et al., 2020). Based on this consideration, we introduce a localized discrepancy (Zhang et al., 2020) to restrict the measure of divergence between domains on a set of relevant candidate hypotheses for approximating the labeling function: Definition 1. Localized Discrepancy. Let K > 0 be the number of queries and ϵ 0. Let TK be a queried batch of size K, the empirical distributions of LK = S TK and T are respectively denoted b QK and b P. Let H be a hypothesis space and L a loss function. The localized discrepancy is defined as: disc HK ϵ ( b QK, b P) = max h,h HK ϵ |L b QK(h, h ) L b P (h, h )| , (1) with HK ϵ = {h H; L(h(x), f Q(x)) ϵ x LK} Published as a conference paper at ICLR 2022 The localized space HK ϵ includes hypotheses consistent with f Q on the labeled data, i.e. hypotheses fitting the labeled data with an error below ϵ. The parameter ϵ drives the size of HK ϵ . Obviously, ϵ needs to be high enough to ensure that HK ϵ is not an empty subset. Under appropriate assumptions on ϵ, a preliminary result is an empirical target risk bound for the localized discrepancy: Proposition 1. Let K > 0 be the number of queries and H a hypothesis space. Let b P and b QK be the empirical distributions of the respective sets T and LK = S TK of respective size n and m + K. We assume that L is a symmetric, µ-Lipschitz and bounded loss function verifying the triangular inequality. We denote by M the bound of L. Let ηH be the ideal maximal error on S T : ηH = min h H max x S T [L(h(x), f Q(x)) + L(h(x), f P (x))] (2) Then, for any ϵ ηH, any hypothesis h HK ϵ and any δ > 0, the following generalization bound holds with at least probability 1-δ: LP (h, f P ) L b QK(h, f Q) + disc HK ϵ ( b QK, b P) + ηH + 2µRn(H) + M This bound share a similar form than the one derived in Cortes et al. (2019c) for the discrepancy. The ideal maximal error ηH characterizes the difficulty of the adaptation problem as well as the ability to learn the labeling functions with the considered hypothesis set H. If f P and f Q significantly differ on S T , ηH will be high and the proposed bound will lack of informativeness. We then follow the common domain adaptation assumption that the difference between the two functions is small (Mansour et al., 2009; Zhang et al., 2020). 2.3 MAIN RESULTS: GENERALIZATION BOUNDS FOR ACTIVE LEARNING Considering the previous bound (Proposition 1) it appears that a natural way of choosing the K queries in an active learning perspective is to pick the target data minimizing the localized discrepancy. Unfortunately this is a difficult problem for an arbitrary functional space H, since it leads to compute a maximum over the set space HK ϵ . Our main idea is then to further bound the localized discrepancy with a computable criterion: Theorem 1. Let K > 0 be the number of queries, H a hypothesis space of k-Lipschitz functions and ϵ ηH. Let LK = S TK be the labeled set and T the target set drawn according to P. We assume that L is a symmetric, µ-Lipschitz and bounded loss function verifying the triangular inequality. We define M such that L(y, y ) M for any y, y Y. For any hypothesis h HK ϵ and any δ > 0, the following generalization bound holds with at least probability 1-δ: LP (h, f P ) L b QK(h, f Q) + 2kµ x T d(x , LK) + 2ϵ + ηH + 2µRn(H) + With d(x , LK) = minx LK d(x , x) and d(x , x) the distance from x to x . Visual insights to understand Theorem 1 are presented in Figure 1 : the main idea is to approximate the maximal hypotheses h, h HK ϵ returning the localized discrepancy by the k-Lipschitz envelope of the labeling function f Q, consistent with f Q on the labeled points, i.e. at most ϵ close to f Q on these points. Indeed, for any target point x T , the gap between h, h on x : L(h (x ), h(x )) is upper bounded by twice the distance from x to its closest labeled point times the Lipschitz constants of L, h and h plus the error on the labeled point : L(h (x ), h(x )) 2kµ d(x , LK) + 2ϵ. The generalization bound of Theorem 1 highlights the trade-off that exists between the Lipschitz constant k of the hypothesis space H and the parameters ηH and ϵ. To reduce the term 2ϵ + ηH in order to obtain tighter controls over the target risk, one needs to consider a more complex set of hypothesis and thus to increase the Lipschitz constant k. Then, to benefit from the theoretical guarantees of this bound, a careful choice of hypothesis set have to be made (cf Section 4.3). The important benefit of the derived bound is to bring out the bounding criterion P x T d(x , LK) which is independent of the hypothesis complexity (characterized by k) and the loss function, it only involves pairwise distances between sample points. As this criterion is computable and depends on the queried batch TK, we can then propose an active learning strategy. Published as a conference paper at ICLR 2022 Figure 1: Visual insights of Theorem 1 : Any potential candidate for f P in HK ϵ returns values between h and h : two hypotheses ϵ close to f Q on the labeled points and with slopes of factor k everywhere. Thus, an approximation of the localized discrepancy is given by the mean of gaps between h and h (mean length of orange arrows) which can be approximated by the distance to the labeled set in X times k. The best target point to label is chosen in order to minimize this sum. 3 DISCREPANCY-BASED ALGORITHM Theorem 1 directly implies that selecting the K queries minimizing P x T d(x , LK) leads to minimize an upper bound of the target risk. We can thus propose an algorithm dedicated to active learning which provides theoretical guarantees on the target risk. Seeking the K queries minimizing P x T d(x , LK) corresponds to solve a K-medoids problem (Kaufmann & Rousseeuw, 1987). Notice however that it does not consist of a K-medoids performed directly on the target domain as the source data are already labeled and considered as medoids. The presented algorithm only differs in the initialization process, where the distance between each target and its nearest source neighbour needs to be computed. Several algorithms exist to solve or approximate the K-medoids (Kaufman & Rousseeuw, 2009), (Ng & Han, 2002), (Park & Jun, 2009). Here we use the greedy version of the algorithm. It can be shown that the gain of the greedy algorithm (the amount of decrease of the criterion by selecting K points) is at least a (1 1/e)-approximation of the optimal gain (cf Appendix). It is well known that K-medoids algorithms suffer from computational burdens or memory issues on large and moderately large data sets ( 100K data) (Newling & Fleuret, 2017). Indeed they require to compute huge pairwise distance matrix between the source and target samples as well as between targets. Precisely, the greedy K-medoids algorithm presents a complexity of O(p(nm + n2) + Kn2) and a memory usage of O(nm + n2) with m, n, K the respective size of S, T and TK. p is the dimension of X. To handle this issue, we propose an adaptation of the K-medoids greedy algorithm with better scalability (Algorithm 1). This algorithm performs the following steps: 1) Computation of the distance to the closest source for each target points via a KD-trees random forest algorithm (Silpa-Anan & Hartley, 2008) of complexity O(T(m + pn) log(m)2) with T the number of trees. 2) Medoids initialization using the greedy algorithm on a random target batch of B samples with complexity O((K + p)B2). 3) An iterative algorithm on the model of (Park & Jun, 2009), combining assignment of each target point to its closest medoid (O(Kpn)) and medoid update inside each cluster. 4) The medoid update for each cluster is done through an original Branch-and-Bound algorithm (Land & Doig, 2010) which estimates the criterion by iteration over mini-batch, target points for which criterion is bigger than a statistical threshold are left aside. Thus, the number of pairwise Published as a conference paper at ICLR 2022 distances to compute is reduced at each iteration until the maximal iteration number is reached or all target points are left aside. Under some assumptions, we can show that the complexity of the update for all cluster is O(n3/2K 1/2) (the proof is given in the supplementary material). The overall complexity is then O(T(m + pn) log(m)2 + (K + p)B2 + Kpn + pn3/2K 1/2) which provides reasonable computational time for moderately large data set (n, m 105 and p 103). Empirical comparison of computational time is also provided in the supplementary material. Algorithm 1 Accelerated K-medoids 1: {d(x , S)}x T KDT-Forest-Nearest-Neighbour(S, T ) # Compute distances to source 2: TK = {x i }i=1..K Kmedois-Greedy(T , B, K, {d(x , S)}x T ) # Initialize medoids 3: Ci {x T | d(x , x i ) = min j=1..Kd(x , x j) d(x , S)} # Assign targets to closest cluster 4: while any medoid x i can be updated do 5: for i from 1 to K do 6: x i = argmin x Ci x Ci d(x , x ) Branch-and-Bound(Ci) # Update medoid 7: end for 8: For all x T , compute d(x , x i ) for all updated medoids and reassign x to the closest Ci 9: end while 4 RELATED WORK AND DISCUSSION 4.1 RELATED WORK Active learning as distribution matching. Active learning methods based on distribution matching aim at reducing the gap between the distributions of the labeled sample and the unlabeled one with a minimal query budget. Several metrics are used to measure the gap between distributions as the Transductive Rademacher Complexity (Gu & Han, 2012), the MMD (Wang & Ye, 2015; Kim et al., 2016; Viering et al., 2019) the Disagreement Coefficient (Hanneke, 2007; Balcan et al., 2009; Beygelzimer et al., 2009; Cortes et al., 2019a;b; 2020), the H-divergence (Sinha et al., 2019; Gissin & Shalev-Shwartz, 2019; Su et al., 2020) or the Wasserstein distance (Shui et al., 2020). To the best of our knowledge, only one paper deals with the discrepancy for active learning (Viering et al., 2019). The authors consider the discrepancy on the space of RKHS hypotheses with PSD kernels and provide an explicit way of computing the discrepancy using eigen-value analysis. However, the corresponding algorithm encounters computational burden and could hardly be applied on large sets. K-medoids for active learning. Many active learning methods use a K-medoids algorithm as an heuristic measure of representativeness (Lin et al., 2009; Gomes & Krause, 2010; Wei et al., 2013; Iyer & Bilmes, 2013; Zheng et al., 2014). The K-medoids is in general computed on a smaller set of selected targets with the higher uncertainties (Wei et al., 2015; Kaushal et al., 2019). In this present work, we provide theoretical insights for this algorithm by highlighting the link with discrepancy minimization. Active learning for domain adaptation. Our work is related to the recent advances on active learning for domain adaptation as we also consider the domain shift hypothesis (Rai et al., 2010; Saha et al., 2011; Deng et al., 2018; Su et al., 2020). These works use in general the output of a domain classifier to measure the informativeness of target samples. In our work, we consider instead the distance to the source sample to capture informative target data. Lipschitz consistent functions for active learning. In the context of function optimization, some methods consider the set of Lipschitz or locally Lipschitz functions consistent with the observations (Valko et al., 2013; Grill et al., 2015; Malherbe & Vayatis, 2017). We use similar functions to approximate the maximal hypotheses returning the localized discrepancy. Notice that the goal of the previous papers differ from ours as they aim at finding the maximum of the labeling function. 4.2 COMPARISON WITH EXISTING GENERALIZATION BOUNDS FOR ACTIVE LEARNING Some previous works on active learning also propose theoretical bound on the target risk. In this section, we will compare them with our derived bound of equation (4) in order to relate our Published as a conference paper at ICLR 2022 contribution with existing works. As their framework is the pure active learning setting which differs from the setting consider here, we make the comparisons under the simplifying assumptions: f = f P = f Q, f H and ϵ = 0. In this case, our bound is written: LP (h, f) (2kµ/n) X x T d(x , LK) + 2µRn(H) + p M 2 log(1/δ)/2n (5) The paper by Sener & Savarese (2018) proposes the K-centers algorithm for active learning based on an easily computable criterion offering theoretical guarantees. For a regression loss L and under the previous assumptions, it controls the target risk as follows: LP (h, f) 2kµ max x T d(x , LK) + 2µRn(H) + O p M 2 log(1/δ)/2n (6) We directly observe that this bound is looser than the one of equation (5). Indeed, in our case the target risk is controlled with the mean of distances between unlabeled points and the labeled set whereas K-centers considers the maximum of these distances. Notice however that both algorithms use greedy selection approximation. Thus, in some cases, the queries from K-centers may lead to a smaller bound than the ones from K-medoids. A generalization bound for active learning involving the Wasserstein distance W1 has also been proposed (Shui et al., 2020). An adaptation of this bound for the proposed scenario with the aforementioned assumptions could be written as follows: LP (h, f) 2kµW1( b QK, b P) + 2µRn(H) + O p M 2 log(1/δ)/2n Where W1( b QK, b P) = arg minγ Γ P x b P γxx d(x, x ) with Γ = {γ Rn (m+K) ; γ1 = 1 n1 ; γT 1 = 1 m+K 1}. Thus, in observing that d(x, x ) d(x , LK) for any x T and x LK we can show that our presented bound of equation (5) is tighter than the one above. 4.3 DISCUSSION ABOUT THE ASSUMPTIONS AND LIMITATIONS One crucial point of the present work is the setting of the ϵ parameter of the hypothesis set HK ϵ . In practice, the parameter ϵ is determined by the hypothesis set and the training algorithm that the learner considers. For instance, if the learner uses over-parameterized hypotheses overfitted on the labeled data set, the parameter ϵ will be small because for any h, h(x) f Q(x) on LK. This could leads to ϵ < ηH and the bound would not be valid anymore. This highlights the trade-off between fitting the source data and generalizing to the target domain. As, in practice, the parameter ηH is hard to estimate, choosing larger ϵ (by regularizing the hypotheses) is safer, but will lead to larger bounds. To deal with this difficulty, we use, in our experiments, a set of neural networks H regularized through weight clipping. The value of the clipping parameter is directly linked to the Lipschitz constant of H. By selecting an adequate clipping parameter and network architecture, we ensure that H is complex enough to learn the task on the source domain but sufficiently regularized to avoid over-fitting (and thus avoid ϵ < ηH). For this purpose, we select the architecture and the clipping parameter through validation on the source labeled data. Regularity assumptions on the loss function L are essentially verified by norms as the Lp which are common losses for regression problems. However, they are are not verified by most classification losses. In fact, classification loss as the cross-entropy is bounded between 0 and 1 and can not increase linearly with the distance to the closest labeled point. In this context, considering target points far away from sources as informative points is not efficient. In fact, the most interesting points are the ones in the margin between classes (Balcan et al., 2007). Thus, in order to focus the K-medoids selection in the margin, and thus extent the proposed algorithm to classification task, we propose an improved version of our algorithm, the Weighted K-medoids (K-medoids+W). This algorithm performs the K-medoids algorithm with a weighted criterion. To compute the weights, we consider the Best-vs-Second-Best (BVSB) criterion (Joshi et al., 2009) which is the difference between the probabilities of the best class and the second best class, given by a hypothesis pre-trained on the source data set. The Weighted K-medoids optimization can be written as follows: x T bvsb(x ) min x S TK d(x, x ) (7) Published as a conference paper at ICLR 2022 5 EXPERIMENTS We choose to compare the performances of our algorithm to classical active learning methods on regression and classification problems in a domain shift context. We consider the single-batch active learning setting (Viering et al., 2019) where all queries are taken at the same time in one batch. We compare the results obtained on the target domain for different query and training methods. The experiments have been run on a (2.7GHz, 16G RAM) computer. The source code is provided on Git Hub 1. We use the open source code of the corresponding authors for BADGE (Ash et al., 2019) and the implementations from ADAPT2 (de Mathelin et al., 2021) for the domain adaptation methods. 5.1 COMPETITORS We compare the proposed approach with the following query methods : Random Sampling; Kmeans (Hu et al., 2010); K-centers (Sener & Savarese, 2018); Diversity (Jain & Grauman, 2016); QBC (Ray Chaudhuri & Hamey, 1995); BVSB (Joshi et al., 2009); AADA (Su et al., 2020) : an hybrid active learning method for domain adaptation using a combination of entropy measure from a classifier and the outputs of a domain discriminator; BADGE (Ash et al., 2019): an hybrid deep active learning method optimizing for both uncertainty and diversity. CLUE (Prabhu et al., 2020) an active domain adaptation strategy that select instances that are both uncertain and diverse. We select four different training methods Uniform Weighting; Balanced Weighting : Assign balanced total weight between source and target instances; Tr Ada Boost (Pardoe & Stone, 2010) : Transfer learning regression method based on a reverse boosting principle; Adversarial training : unsupervised features transformation on the model of DANN (Ganin et al., 2016). To make a fair comparison between the different query strategies, we use for all experiments, the same set of training hypotheses H. We define H as the set of neural networks composed of two fully connected hidden layers of 100 neurons, with Re LU activations and projection constraints on the layer norms (< 1.). We use the Adam optimizer (Kingma & Ba, 2015). The network architecture is defined to be complex enough that the network provides a good approximation of the labeling function on the source domain in all experiments. Besides, for each experiment, fine-tuning of the optimization hyper-parameters (epochs, batch sizes...) is performed using only source labeled data. We assume that the architecture and the resulting hyper-parameters will still be appropriate after adding the queried target data to the training set (see Section 4.3). Finally, for distance-based algorithm, we consider the L1 distance computed in the penultimate layer of a network pre-trained on sources. We use an ensemble of 10 models in QBC and the greedy version of K-centers. 5.2 SUPERCONDUCTIVITY DATA SET As there is very few public data sets for domain adaptation with regression tasks (Teshima et al., 2020), we choose an UCI data set with a reasonable amount of instances and split it in different domains using the setup of (Pardoe & Stone, 2010). We choose Superconductivity (Hamidieh, 2018; Dua & Graff, 2017) which is composed of features extracted from superconductors chemical formula. The task consists in predicting their critical temperature. Experimental setup: The data set is divided in four domains following (Pardoe & Stone, 2010) : low (l), middle-low (ml), middle-high (mh) and high (h) of around 4000 instances and 166 features. We use a learning rate of 0.001, a number of epochs of 100, a batch size of 128 and the mean squared error as loss function. We conduct an experiment for the 12 pairs of domains. We vary K from 5 to 300 and repeated each experiment 8 times with different random seeds. We report the mean absolute error (MAE) on the target unlabeled data for all experiments when K = 20 in Table 1. We present the MAE evolution for the adaptation from mh to h in Figure 2. Results: We observe on Figure 2 that, for any K > 0, the K-medoids algorithm presents the lowest MAE on the target data for the three different training methods. In particular we observe a significant performance gain of K-medoids when using Tr Ada Boost which provides the lowest MAE for the majority of fixed budget K compared to other training methods. These observations are confirmed on Table 1 where we observe that K-medoids presents the lowest MAE in 11 experiments over 12. We 1https://github.com/antoinedemathelin/dbal 2https://github.com/adapt-python/adapt Published as a conference paper at ICLR 2022 Table 1: MAE on the critical temperature for the Superconductivity experiments with Balanced Weighting and K = 20. Standard deviation are given in the supplementary material. Experiment l ml l mh l h ml l ml mh ml h mh l mh ml mh h h l h ml h mh Random 15.33 15.80 17.45 16.53 11.39 14.70 17.65 12.83 10.36 18.75 14.86 10.54 Kmeans 14.43 13.60 13.98 15.79 10.19 12.73 17.18 12.67 10.02 22.10 14.69 9.76 QBC 20.00 19.03 20.08 15.89 12.24 15.31 20.78 12.87 10.19 31.88 18.86 10.65 Kcenters 19.21 15.73 16.85 15.75 11.62 13.44 22.17 12.74 10.24 36.50 19.60 10.39 Diversity 19.46 18.21 18.68 16.01 11.94 15.36 23.92 14.31 10.70 37.97 20.89 10.78 Kmedoids 12.70 13.57 14.11 14.49 10.02 12.52 15.36 12.37 9.79 16.62 14.14 9.32 also observe here that methods based on spatial consideration as K-medoids, K-means and K-centers select more informative target points than the uncertainty based method QBC. This comes from the fact that, in batch mode, QBC is selecting close target points with similar uncertainty level. Finally, K-medoids outperforms K-means because it takes into account the distance to source points and then queries less redundant information. Figure 2: Results for the Superconductivity data set (mh h experiment). Evolution of the MAE in function of the budget K for three different training methods and six query methods. 5.3 OFFICE DATA SET The office data set (Saenko et al., 2010) consists in pictures of office items coming from different domains: amazon or webcam. The task is a multi-classification problem with 31 classes (chairs, printers, ...). The goal is to use data from the amazon domain where labels are easily available to learn a good model on the webcam domain where a few labels are chosen using active learning methods. Experimental setup: We consider the adaptation from amazon with 2817 labeled images to webcam with 795 unlabeled images. We use, as input features, the outputs of the Res Net-50 network (He et al., 2016) pretrained on Image Net. We vary K from 5 to 300, repeating each experiment 8 times. The learning rate is 0.001, the number of epochs 60 and the batch size 128. Results: Figure 3.A presents the results obtained. We observe that the K-medoids+W algorithm provides the best performances for almost any K and in particular for small values of K. We then present the visualization of the two first components of the PCA transform on Figure 4. We observe that the K-medoids+W algorithm queries points at the center of the target distribution but at a reasonable distance from the sources. The Random and K-means algorithms select a representative subset of the target distribution but without taking into account the sources and therefore query redundant information. K-centers selects data far from the source domain but which are less representative of the the target distribution. 5.4 DIGITS DATA SET We consider the experiment proposed in (Ganin et al., 2016) where a synthetic digits data set: SYNTH is used to learn a classification task for a data set of real digits pictures: SVHN (Street-View House Number) (Netzer et al., 2011). Both data sets are composed of around 65k images of size 28 28. Published as a conference paper at ICLR 2022 Figure 3: Office and digits results. Evolution of the accuracy in function of the budget K. Figure 4: Visualization of the two PCA first components of the Office data set input space. Queries are reported with black points for each query method with K = 20. Experimental setup: To handle the large number of data, we use the accelerated K-medoids algorithm (Algorithm 1) with T = 50 trees and a initial batch size of B = 5000. We use the KD-trees random forest nearest neighbour algorithm in Diversity and K-centers to approximate the distance of each target data to the source data set. We consider two kinds of input features: the ones obtained with the convulational part of a Lenet (Le Cun et al., 1998) trained with the source labeled data and the ones coming from the same network but trained with adversarial training following the model of DANN (Ganin et al., 2016). In both cases, the network is pre-trained on 30 epochs with a batch size of 128 and a learning rate of 0.001, for the adversarial training the trade-off parameter λ is set to 0.1 following the setup of (Su et al., 2020). After the query process, a Balance Weighting training is performed with the source and target labeled data using the same optimization hyper-parameters than before. Experiments are conducted 8 times for K between 10 and 500. Results: Figure 3.B and 3.C correspond to the evolution of accuracy with respect to K for the experiments conducted with the features obtained respectively without and with adversarial training. We observe that, for any K > 0, K-medoids+W provides improved results over other query strategies in both cases. This highlights the ability of the method to select informative target points in a variety of scenarios. 6 CONCLUSION AND FUTURE WORK This work introduces a novel active learning approach based on a localized discrepancy between the labeled and unlabeled distributions. We provide both theoretical guarantees of this approach and an active learning algorithm scaling to large data sets. Several experiments show very competitive results of the proposed approach. Future work will focus on considering a more appropriate distance on the input space, giving more importance to relevant features with respect to the task. Published as a conference paper at ICLR 2022 REPRODUCIBILITY STATEMENT To help the reproducibility of the results presented in this work, the source code of the experiments is available at https://github.com/antoinedemathelin/dbal. Furthermore, the presented methods and the majority of the competitors have been implemented with pythonic objects which implement a fit and predict methods. Thus, the code can easily be used on other data sets than the ones considered in this work. Finally, notebooks are provided in the repository to enable a rapid access to the methods and the possibility to try for different hyper-parameters. ACKNOWLEDGMENTS Part of this research was funded by the Manufacture Franc aise des Pneumatiques Michelin and the Industrial Data Analytics and Machine Learning chair of Centre Borelli from ENS Paris Saclay. Jordan T Ash, Chicheng Zhang, Akshay Krishnamurthy, John Langford, and Alekh Agarwal. Deep batch active learning by diverse, uncertain gradient lower bounds. In International Conference on Learning Representations, 2019. Maria-Florina Balcan, Andrei Broder, and Tong Zhang. Margin based active learning. In International Conference on Computational Learning Theory, pp. 35 50. Springer, 2007. Maria-Florina Balcan, Alina Beygelzimer, and John Langford. Agnostic active learning. Journal of Computer and System Sciences, 75(1):78 89, 2009. Alina Beygelzimer, Sanjoy Dasgupta, and John Langford. Importance Weighted Active Learning, pp. 49 56. Association for Computing Machinery, 2009. Zal an Bod o, Zsolt Minier, and Lehel Csat o. Active learning with clustering. In Active Learning and Experimental Design Workshop in Conjunction with AISTATS 2010, pp. 127 139. JMLR Workshop and Conference Proceedings, 2011. David Cohn, Les Atlas, and Richard Ladner. Improving generalization with active learning. Machine learning, 15(2):201 221, 1994. Corinna Cortes and Mehryar Mohri. Domain adaptation and sample bias correction theory and algorithm for regression. Theoretical Computer Science, 519, 2014. Corinna Cortes, Giulia De Salvo, Claudio Gentile, Mehryar Mohri, and Ningshan Zhang. Regionbased active learning. In The 22nd International Conference on Artificial Intelligence and Statistics, pp. 2801 2809, 2019a. Corinna Cortes, Giulia De Salvo, Mehryar Mohri, Ningshan Zhang, and Claudio Gentile. Active learning with disagreement graphs. In International Conference on Machine Learning, pp. 1379 1387, 2019b. Corinna Cortes, Mehryar Mohri, and Andr es Mu noz Medina. Adaptation based on generalized discrepancy. J. Mach. Learn. Res., 20(1):1 30, January 2019c. ISSN 1532-4435. Corinna Cortes, Giulia Desalvo, Claudio Gentile, Mehryar Mohri, and Ningshan Zhang. Adaptive region-based active learning. In Hal Daum e III and Aarti Singh (eds.), Proceedings of the 37th International Conference on Machine Learning, volume 119 of Proceedings of Machine Learning Research, pp. 2144 2153. PMLR, 13 18 Jul 2020. Antoine de Mathelin, Franc ois Deheeger, Guillaume Richard, Mathilde Mougeot, and Nicolas Vayatis. Adapt: Awesome domain adaptation python toolbox. ar Xiv preprint ar Xiv:2107.03049, 2021. Cheng Deng, Xianglong Liu, Chao Li, and Dacheng Tao. Active multi-kernel domain adaptation for hyperspectral image classification. Pattern Recognition, 77:306 315, 2018. Published as a conference paper at ICLR 2022 Dheeru Dua and Casey Graff. UCI machine learning repository, 2017. URL http://archive. ics.uci.edu/ml. Yarin Gal, Riashat Islam, and Zoubin Ghahramani. Deep Bayesian active learning with image data. In Doina Precup and Yee Whye Teh (eds.), Proceedings of the 34th International Conference on Machine Learning, volume 70 of Proceedings of Machine Learning Research, pp. 1183 1192, International Convention Centre, Sydney, Australia, 06 11 Aug 2017. PMLR. Yaroslav Ganin, Evgeniya Ustinova, Hana Ajakan, Pascal Germain, Hugo Larochelle, Franc ois Laviolette, Mario Marchand, and Victor Lempitsky. Domain-adversarial training of neural networks. J. Mach. Learn. Res., 17(1):2096 2030, January 2016. ISSN 1532-4435. Daniel Gissin and Shai Shalev-Shwartz. Discriminative active learning. ar Xiv preprint ar Xiv:1907.06347, 2019. Ryan Gomes and Andreas Krause. Budgeted nonparametric learning from data streams. In Proceedings of the 27th International Conference on International Conference on Machine Learning, ICML 10, pp. 391 398, Madison, WI, USA, 2010. Omnipress. Jean-Bastien Grill, Michal Valko, and R emi Munos. Black-box optimization of noisy functions with unknown smoothness. In Neural Information Processing Systems, 2015. Q. Gu and J. Han. Towards active learning on graphs: An error bound minimization approach. In 2012 IEEE 12th International Conference on Data Mining, pp. 882 887, 2012. Kam Hamidieh. A data-driven statistical model for predicting the critical temperature of a superconductor. Computational Materials Science, 154:346 354, 2018. URL https://archive.ics. uci.edu/ml/datasets/Superconductivty+Data#. Steve Hanneke. A bound on the label complexity of agnostic active learning. In Proceedings of the 24th international conference on Machine learning, pp. 353 360, 2007. Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 770 778, 2016. Wassily Hoeffding. Probability inequalities for sums of bounded random variables. In The Collected Works of Wassily Hoeffding, pp. 409 426. Springer, 1994. Rong Hu, Brian Mac Namee, and Sarah Jane Delany. Off to a good start: Using clustering to select the initial training set in active learning. In FLAIRS Conference, 2010. Rishabh K Iyer and Jeff A Bilmes. Submodular optimization with submodular cover and submodular knapsack constraints. In Advances in Neural Information Processing Systems, pp. 2436 2444, 2013. S. D. Jain and K. Grauman. Active image segmentation propagation. In 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 2864 2873, 2016. A. J. Joshi, F. Porikli, and N. Papanikolopoulos. Multi-class active learning for image classification. In 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 2372 2379, 2009. Leonard Kaufman and Peter J Rousseeuw. Finding groups in data: an introduction to cluster analysis, volume 344. John Wiley & Sons, 2009. Leonard Kaufmann and Peter Rousseeuw. Clustering by means of medoids. Data Analysis based on the L1-Norm and Related Methods, pp. 405 416, 01 1987. V. Kaushal, R. Iyer, S. Kothawade, R. Mahadev, K. Doctor, and G. Ramakrishnan. Learning from less data: A unified data subset selection and active learning framework for computer vision. In 2019 IEEE Winter Conference on Applications of Computer Vision (WACV), pp. 1289 1299, 2019. Been Kim, Rajiv Khanna, and Oluwasanmi O Koyejo. Examples are not enough, learn to criticize! criticism for interpretability. Advances in neural information processing systems, 29, 2016. Published as a conference paper at ICLR 2022 Diederik P. Kingma and Jimmy Ba. Adam: A method for stochastic optimization. In Yoshua Bengio and Yann Le Cun (eds.), 3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, May 7-9, 2015, Conference Track Proceedings, 2015. Karol Kurach, Mario Luˇci c, Xiaohua Zhai, Marcin Michalski, and Sylvain Gelly. A large-scale study on regularization and normalization in gans. In International Conference on Machine Learning, pp. 3581 3590. PMLR, 2019. Ailsa H Land and Alison G Doig. An automatic method for solving discrete programming problems. In 50 Years of Integer Programming 1958-2008, pp. 105 132. Springer, 2010. Yann Le Cun, L eon Bottou, Yoshua Bengio, and Patrick Haffner. Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11):2278 2324, 1998. Hui Lin, Jeff Bilmes, and Shasha Xie. Graph-based submodular selection for extractive summarization. In 2009 IEEE Workshop on Automatic Speech Recognition & Understanding, pp. 381 386. IEEE, 2009. C edric Malherbe and Nicolas Vayatis. Global optimization of lipschitz functions. In International Conference on Machine Learning, pp. 2314 2323. PMLR, 2017. Yishay Mansour, Mehryar Mohri, and Afshin Rostamizadeh. Domain adaptation: Learning bounds and algorithms. In COLT, 2009. Andreas Maurer and Massimiliano Pontil. Empirical bernstein bounds and sample variance penalization. ar Xiv preprint ar Xiv:0907.3740, 2009. Mehryar Mohri, Afshin Rostamizadeh, and Ameet Talwalkar. Foundations of machine learning. MIT press, 2018. Saeid Motiian, Marco Piccirilli, Donald A Adjeroh, and Gianfranco Doretto. Unified deep supervised domain adaptation and generalization. In Proceedings of the IEEE International Conference on Computer Vision, pp. 5715 5725, 2017. George L Nemhauser, Laurence A Wolsey, and Marshall L Fisher. An analysis of approximations for maximizing submodular set functions i. Mathematical programming, 14(1):265 294, 1978. Yuval Netzer, Tao Wang, Adam Coates, Alessandro Bissacco, Bo Wu, and Andrew Y. Ng. Reading digits in natural images with unsupervised feature learning. In NIPS Workshop on Deep Learning and Unsupervised Feature Learning 2011, 2011. James Newling and Franc ois Fleuret. A sub-quadratic exact medoid algorithm. In Artificial Intelligence and Statistics, pp. 185 193. PMLR, 2017. Raymond T. Ng and Jiawei Han. Clarans: A method for clustering objects for spatial data mining. IEEE transactions on knowledge and data engineering, 14(5):1003 1016, 2002. David Pardoe and Peter Stone. Boosting for regression transfer. In Proceedings of the 27th International Conference on Machine Learning (ICML), June 2010. Hae-Sang Park and Chi-Hyuck Jun. A simple and fast algorithm for k-medoids clustering. Expert systems with applications, 36(2):3336 3341, 2009. F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau, M. Brucher, M. Perrot, and E. Duchesnay. Scikit-learn: Machine learning in Python. Journal of Machine Learning Research, 12:2825 2830, 2011. Viraj Prabhu, Arjun Chandrasekaran, Kate Saenko, and Judy Hoffman. Active domain adaptation via clustering uncertainty-weighted embeddings. ar Xiv preprint ar Xiv:2010.08666, 2020. Piyush Rai, Avishek Saha, Hal Daum e III, and Suresh Venkatasubramanian. Domain adaptation meets active learning. In Proceedings of the NAACL HLT 2010 Workshop on Active Learning for Natural Language Processing, pp. 27 32, 2010. Published as a conference paper at ICLR 2022 T. Ray Chaudhuri and L. G. C. Hamey. Minimisation of data collection by active learning. In Proceedings of ICNN 95 - International Conference on Neural Networks, volume 3, pp. 1338 1341 vol.3, 1995. Kate Saenko, Brian Kulis, Mario Fritz, and Trevor Darrell. Adapting visual category models to new domains. In Proceedings of the 11th European Conference on Computer Vision: Part IV, ECCV 10, pp. 213 226, Berlin, Heidelberg, 2010. Springer-Verlag. Avishek Saha, Piyush Rai, Hal Daum e, Suresh Venkatasubramanian, and Scott L Du Vall. Active supervised domain adaptation. In Joint European Conference on Machine Learning and Knowledge Discovery in Databases, pp. 97 112. Springer, 2011. Ozan Sener and Silvio Savarese. Active learning for convolutional neural networks: A core-set approach. In International Conference on Learning Representations, 2018. Burr Settles. Active learning literature survey. University of Wisconsin, Madison, 52, 07 2010. Changjian Shui, Fan Zhou, Christian Gagn e, and Boyu Wang. Deep active learning: Unified and principled method for query and training. In International Conference on Artificial Intelligence and Statistics, pp. 1308 1318. PMLR, 2020. Chanop Silpa-Anan and Richard Hartley. Optimised kd-trees for fast image descriptor matching. In 2008 IEEE Conference on Computer Vision and Pattern Recognition, pp. 1 8. IEEE, 2008. Samarth Sinha, Sayna Ebrahimi, and Trevor Darrell. Variational adversarial active learning. In Proceedings of the IEEE International Conference on Computer Vision, pp. 5972 5981, 2019. J. Su, Y. Tsai, K. Sohn, B. Liu, S. Maji, and M. Chandraker. Active adversarial domain adaptation. In 2020 IEEE Winter Conference on Applications of Computer Vision (WACV), pp. 728 737, 2020. Takeshi Teshima, Issei Sato, and Masashi Sugiyama. Few-shot domain adaptation by causal mechanism transfer. In International Conference on Machine Learning, pp. 9458 9469. PMLR, 2020. Michal Valko, Alexandra Carpentier, and R emi Munos. Stochastic simultaneous optimistic optimization. In International Conference on Machine Learning, pp. 19 27. PMLR, 2013. Tom Viering, Jesse Krijthe, and Marco Loog. Nuclear discrepancy for single-shot batch active learning. Machine Learning, 06 2019. Zheng Wang and Jieping Ye. Querying discriminative and representative samples for batch mode active learning. ACM Trans. Knowl. Discov. Data, 9(3), February 2015. Kai Wei, Yuzong Liu, Katrin Kirchhoff, and Jeff Bilmes. Using document summarization techniques for speech data subset selection. In Proceedings of the 2013 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 721 726, Atlanta, Georgia, June 2013. Association for Computational Linguistics. Kai Wei, Rishabh Iyer, and Jeff Bilmes. Submodularity in data subset selection and active learning. In Francis Bach and David Blei (eds.), International Conference on Machine Learning, volume 37 of Proceedings of Machine Learning Research, pp. 1954 1963, Lille, France, 07 09 Jul 2015. PMLR. Yuchen Zhang, Tianle Liu, Mingsheng Long, and Michael Jordan. Bridging theory and algorithm for domain adaptation. In Kamalika Chaudhuri and Ruslan Salakhutdinov (eds.), Proceedings of the 36th International Conference on Machine Learning, volume 97 of Proceedings of Machine Learning Research, pp. 7404 7413, Long Beach, California, USA, 09 15 Jun 2019. PMLR. Yuchen Zhang, Mingsheng Long, Jianmin Wang, and Michael I Jordan. On localized discrepancy for domain adaptation. ar Xiv preprint ar Xiv:2008.06242, 2020. Jingjing Zheng, Zhuolin Jiang, Rama Chellappa, and Jonathon P Phillips. Submodular attribute selection for action recognition in video. In Z. Ghahramani, M. Welling, C. Cortes, N. D. Lawrence, and K. Q. Weinberger (eds.), Advances in Neural Information Processing Systems 27, pp. 1341 1349. Curran Associates, Inc., 2014. Published as a conference paper at ICLR 2022 We recall here the notations and definitions used in the following: X Rp and Y R are the respective input and output subsets. d : X X R+ is a distance on X. P and Q are two distributions on X. T = {x 1, ..., x n} X n is the unlabeled target data set and S = {x1, ..., xm} X m the labeled source data set drawn respectively from P and Q. L : Y Y R+ is a loss function. H is a hypothesis set of k-Lipschitz functions from X to Y. Ex D[L(h(x), h (x))] is the average loss (or risk) over any distribution D on X between two hypotheses h, h H. Rn(H) = E {x i}i P i=1 σih(x i) , is the expected Rademacher complex- ity of H, with U the uniform distribution on { 1, 1}. f Q : X Y is the source labeling function. f P : X Y is the target labeling function. K > 0 is the number of queries TK T with |TK| = K is a queried batch or subset. LK = S TK is the labeled data set. b Q, b QK and b P are the respective empirical distributions on X of S, S TK and T . HK ϵ = {h H; L(h(x), f Q(x)) ϵ x LK} is the localized hypothesis space . disc HK ϵ ( b QK, b P) = maxh,h HK ϵ |L b QK(h, h ) L b P (h, h )| is the localized discrepancy between b QK and b P. A PROOF OF PROPOSITION 1 Proposition 1. Let K > 0 be the number of queries and H a hypothesis space. Let b P and b QK be the empirical distributions of the respective sets T and LK = S TK of respective size n and m + K. We assume that L is a symmetric, µ-Lipschitz and bounded loss function verifying the triangular inequality. We denote by M the bound of L. Let ηH be the ideal maximal error on S T : ηH = min h H max x S T [L(h(x), f Q(x)) + L(h(x), f P (x))] (8) Then, for any ϵ ηH, any hypothesis h HK ϵ and any δ > 0, the following generalization bound holds with at least probability 1-δ: LP (h, f P ) L b QK(h, f Q) + disc HK ϵ ( b QK, b P) + ηH + 2µRn(H) + M Proof. Let s consider h H. According to (Mohri et al., 2018) we have for any δ > 0, with probability at least 1 δ: LP (h, f P ) L b P (h, f P ) + 2µRn(H) + M δ ) 2n . (10) Besides, we have for any h, h0 HK ϵ : Published as a conference paper at ICLR 2022 L b P (h, f P ) = L b QK(h, f Q) + L b P (h, f P ) L b QK(h, f Q) L b QK(h, f Q) + L b P (h, h0) + L b P (h0, f P ) L b QK(h, h0) + L b QK(h0, f Q) L b QK(h, f Q) + max h,h HK ϵ |L b P (h, h ) L b QK(h, h )| + L b P (h0, f P ) + L b QK(h0, f Q) As the inequality is true for any h0 HK ϵ , we have in particular: L b P (h, f P ) L b QK(h, f Q) + max h,h HK ϵ |L b P (h, h ) L b QK(h, h )| + min h0 HK ϵ L b P (h0, f P ) + L b QK(h0, f Q) We notice that, for any h0 HK ϵ : L b P (h0, f P ) + L b QK(h0, f Q) max x S T [L(h0(x), f Q(x)) + L(h0(x), f P (x))] (13) From which we deduce that, min h0 HK ϵ L b P (h0, f) + L b QK(h0, f) ηHK ϵ (14) Let s now consider h H, such that: h = arg min h H max x S T [L(h(x), f Q(x)) + L(h(x), f P (x))] (15) By assumption we have ηH ϵ, and thus: max x S T [L(h (x), f Q(x)) + L(h (x), f P (x))] ϵ (16) This implies that for any x S T : L(h (x), f Q(x)) ϵ (17) In particular, L(h (x), f Q(x)) ϵ for any x LK, which implies that h is in HK ϵ . We then deduce that: ηH = ηHK ϵ (18) Thus we conclude that for any h HK ϵ and any δ > 0, we have with probability at least 1 δ: LP (h, f P ) L b QK(h, f Q) + disc HK ϵ ( b QK, b P) + ηH + 2µRn(H) + M Published as a conference paper at ICLR 2022 B PROOF OF THEOREM 1 Theorem 1. Let K > 0 be the number of queries, H a hypothesis space of k-Lipschitz functions and ϵ ηH. Let LK = S TK be the labeled set and T the target set drawn according to P. We assume that L is a symmetric, µ-Lipschitz and bounded loss function verifying the triangular inequality. We denote by M the bound of L. For any hypothesis h HK ϵ and any δ > 0, the following generalization bound holds with at least probability 1-δ: LP (h, f) L b QK(h, f Q) + 2kµ x T d(x , LK) + 2ϵ + ηH + 2µRn(H) + δ ) 2n (20) With d(x , LK) = minx LK d(x , x) Proof. Let ϵ ηH and 0 K n. For all h, h HK ϵ , for all x T and for all x LK we have: L(h(x ), h (x )) L(h(x ), h(x)) + L(h(x), h (x)) + L(h (x), h (x )) L(h(x ), h(x)) + L(h (x), h (x )) + L(h(x), f Q(x)) + L(f Q(x), h (x)) µ (|h(x) h(x )| + |h (x) h (x )|) + 2ϵ 2kµ d(x , x) + 2ϵ (21) The two first inequalities come from the triangular inequality, the others from the lipschitzness of h, h and definition of HK ϵ . As the above inequality is true for any x LK, we have in particular for any x T : L(h(x ), h (x )) 2kµ min x LK d(x, x ) + 2ϵ = 2kµ d(x , LK) + 2ϵ (22) Leading to: L b P (h(x ), h (x )) 2kµ x T d(x , LK) + 2ϵ (23) We then deduce the following, for all h, h HK ϵ : disc HK ϵ ( b QK, b P) = max h,h HK ϵ |L b P (h, h ) L b QK(h, h )| max max h,h HK ϵ L b P (h, h ), max h,h HK ϵ L b QK(h, h ) x T d(x , LK) + 2ϵ, 2ϵ x T d(x , LK) + 2ϵ Finally, according to Proposition 1, we have for all h, h HK ϵ : LP (h, f) L b QK(h, f Q) + 2kµ x T d(x , LK) + 2ϵ + ηH + 2µRn(H) + δ ) 2n (25) Published as a conference paper at ICLR 2022 C APPROXIMATION ERROR BETWEEN PROPOSITION 1 AND THEOREM 1 We present in this section the approximation error of the relaxation between the bounds of Proposition 1 and Theorem 1. We will show that, with the assumption of Theorem 1 and in the case L = L1, we have, for any labeled set LK: x T d(x , LK) k k kf disc HK ϵ ( b QK, b P) (26) With kf the Lipschitz constant of the source labeling function f Q. We assume that k > kf. For this purpose we will look for two hypotheses h, h HK ϵ verifying: L b P (h, h ) 2(k kf) x T d(x , LK) (27) Let s define k = k kf, and h, h : X Y such that, for any x X: h(x) = f Q(x) + k min x LK d(x, x) = f Q(x) + k d(x, LK) h (x) = f Q(x) k min x LK d(x, x) = f Q(x) k d(x, LK) (28) We will now prove that h, h are in HK ϵ : Let s consider x1, x2 X, we define: x1 = arg min x LK d( x, x1) x2 = arg min x LK d( x, x2) (29) Assuming without restriction that d(x1, x1) > d(x2, x2), we have: |h(x1) h(x2)| k |d(x1, LK) d(x2, LK)| + kf d(x1, x2) k (d(x1, x1) d(x2, x2)) + kf d(x1, x2) k (d(x1, x2) d(x2, x2)) + kf d(x1, x2) k d(x1, x2) + kf d(x1, x2) k d(x1, x2) Using the triangular inequality and the fact that d(x1, x1) d(x1, x2) by definition of x1. Using a similar development we can prove the k-Lipschitzness of h . Let s now consider x LK, we have L(h(x), f Q(x)) = L(f Q(x) + k d(x, LK), f Q(x)) = L(f Q(x), f Q(x)) = 0 (31) In the same way L(h (x), f Q(x)) = 0 and we have h, h HK ϵ . Published as a conference paper at ICLR 2022 Furthermore, we have: L b P (h, h ) = 1 x T L(h(x ), h (x )) x T |h(x ) h (x )| x T d(x , LK) disc HK ϵ ( b QK, b P) |L b P (h, h ) 0| = 2(k kf) x T d(x , LK) (33) From which we conclude. D COMPARISON WITH OTHER ACTIVE LEARNING BOUNDS (CF SECTION 4.2) In this section we assume that f = f P = f Q and f H with H a set of k-Lipschitz functions. D.1 K-CENTER BOUNDS Sener and Savarese (Sener & Savarese, 2018) propose the following generalization bounds for ϵ = 0 and h HK ϵ : L b P (h, f) δ(λl + MCλµ) + M which leads to: LP (h, f) δ(λl + MCλµ) + 2µRn(H) + 2M With δ = maxx T d(x , LK), C the class number and λµ the Lipschitz constant of a class-specific regression function. λl is the Lipschitz constant of the loss function l verifying l : (x, h) l(x, f(x), h) = L(h(x), f(x)). If we consider a regression problem, we can drop the term corresponding to the class-specific function and we have: LP (h, f) δλl + 2µRn(H) + O Let s now consider h H and x, x X, we have for k-Lipschitz f : Published as a conference paper at ICLR 2022 |l(x, f(x), h) l(x , f(x ), h)| = |L(h(x), f(x)) L(h(x ), f(x ))| |L(h(x), f(x)) L(h(x ), f(x))| + |L(h(x ), f(x)) L(h(x ), f(x ))| L(h(x), h(x )) + L(f(x), f(x )) µ|h(x) h(x )| + µ|f(x) f(x )| 2µk|x x | (37) The two first inequalities are obtained with triangular inequalities, then we use Lipschitz assumptions on h, f and L. Thus, we have λl = 2µk from which we deduce: LP (h, f) 2kµ max x T d(x , LK) + 2µRn(H) + O D.2 WASSERSTEIN BOUNDS To adapt the bound from (Shui et al., 2020) Corollary 1 to our setting, we identify the distributions b D and b Q from Shui et al. (2020) with respectively the distributions b P and b QK. We then have the following generalization bound: LP (h, f) L b QK(h, f) + 2µk W1( b QK, b P) + 2µRn(H) + O Notice that the term corresponding to the labeling function decay property in the bound of Shui et al. (2020) is null when considering a Lipschitz labeling function (f H). Thus for ϵ = 0 and for any h HK ϵ , we have: LP (h, f) 2kµW1( b QK, b P) + 2µRn(H) + O x LK γ x x d(x , x) + 2µRn(H) + O With γ = arg minγ Γ P x LK γx xd(x , x) and Γ = {γ Rn (m+K) ; γ1 = 1 n1 ; γT 1 = 1 m+K 1}. Thus, in observing that: X x LK γx xd(x , x) X x LK γx x min x LK d(x , x) x T d(x , LK) X x T d(x , LK) We deduce that our presented bound of Theorem 1 is also tighter than the one proposed in (Shui et al., 2020) in the case of ϵ = 0 and k-Lipschitz f and h. Published as a conference paper at ICLR 2022 E ALGORITHMS (CF SECTION 3) E.1 K-MEDOIDS Algorithm 2 K-Medoids Greedy 1: Input: T , B, K, S, {d(x , S)}x T ) 2: Output: TK = {xj}j K T 3: T {x s1, ..., x s B} with {x s1, ..., x s B} picked randomly in T without replacement. 4: Initialize query subset: T0 = {} 5: For all x T , d x = d(x , S) = minx S d(x , x) 6: For all x, x T compute d xx = d(x, x ) 7: for i from 1 to K do 8: xi arg minx T P x T min d xx , d x 9: Si Si 1 {xi} 10: For all x T , update d x = min(d x , d xix )) 11: end for Algorithm 3 Branch & Bound Medoid (B & B) 1: Input: Cluster C Rnc p, previous medoid criterion C , batch size B 2: Output: New medoid x 3: Initialize candidates C = C 4: Initialize computed distance set Dx = {} for all x C. 5: Initialize criterion Cx = 0 and standard deviation σx = 0 for all x C. 6: Initialize threshold t = C 7: for i from 1 to nc/B do 8: Ci = {xj C ; (B 1)i j Bi} 9: for x C do 10: Dx Dx {d(x, x ) ; x Ci} 11: Cx 1 Bi P d Dx d d Dx(d Cx)2 13: end for 14: x arg minx C Cx 15: t min(t, Cx + 2σx 16: C n x C ; Cx 2σx 17: end for 18: x arg minx C Cx E.2 APPROXIMATION BOUND FOR THE GREEDY ALGORITHM This section is dedicated to the proof of the bound for the greedy K-medoids algorithm which is expressed as follows: The gain of selecting K new medoids with the greedy algorithm is an (1 1/e)-approximation of the optimal gain. Let TK be a batch of K target points selected with the greedy algorithm. The gain is the difference between the initial objective and the final objective after selecting TK: gain(TK) = X x T d(x , S) X x T d(x , S TK) (41) With d(x , S) = minx S d(x , x) Published as a conference paper at ICLR 2022 We will prove now that the gain is monotone submodular. Let A B T and x T \ B, We denote T A = {x T ; d(x , x) < d(x , A)} and T B = {x T ; d(x , x) < d(x , B)}. As A B, we have d(x , A) d(x , B) for any x T . Thus, T B T A (42) Besides, for any x T , d(x , A) d(x , x) d(x , B) d(x , x) (43) We deduce then that, gain(A {x}) gain(A) = X x T A d(x , A) d(x , x) x T B d(x , A) d(x , x) x T B d(x , B) d(x , x) = gain(B {x}) gain(B) Then, the gain is submodular. Besides, as A B it appears clearly that, gain(A) gain(B) (45) and the gain is monotone. Finally, by noticing that the gain is always positive and according to (Nemhauser et al., 1978), we conclude that the gain of selecting K new medoids with the greedy algorithm is an (1 1/e)- approximation of the optimal gain. E.3 COMPLEXITY COMPUTATION In the following, a distance computation is considered to be done in O(p). 1. KD-Trees Random Forest: Each of the T trees is built by splitting one sample, at each root, at the median of a random feature until the leaf-sizes are log(m). The median computation for each root with mr data is in O(mr log(mr)). Thus the overall complexity to build one tree is O(PM i=0 2 im log(2 im)2i) = O(m log(m) PM i=0 1) with M log(m/ log(m)) which becomes O(m log(m)2). Then, each target is assigned to a leaf by performing O(log(m)) computations, inside the assigned leaf all distance computations are done in O(p log(m)). Thus an approximate nearest neighbour is given for all targets with a complexity O(T(m + pn) log(m)2). 2. Medoids Initialization: Using the greedy algorithm, O(p B2) distance computations are first done, then, for all targets, a sum over the target data is computed at each of the K steps. Thus the complexity is O((K + p)B2). 3. Assignation to the Closest Medoid: The distance between all target and the medoids is computed in O(Kpn). 4. Branch & Bound Medoid Computation B & B algorithm (Algorithm 3) takes as input one cluster C of nc unlabeled data from T . It also takes a batch size B and the previous cluster medoid criterion C which is used as Published as a conference paper at ICLR 2022 an initial threshold. The use of the initialization C may accelerate the algorithm, in the following we do not take into account this initialization, i.e we consider that C = + . Besides, to use notations consistent with the common notations in statistics, we will denote the batch size B by n (B n). Notice that it is redundant with the size of the unlabeled data set T . An explicit mention will be made, if n does not refer to the batch size. Definitions and notations: Let s consider one cluster C T with nc data. We consider the uniform norm as underlying distance d, defined for all xi, xj C as d(xi, xj) = max |x(1) i x(1) j |, ..., |x(p) i x(p) j | with xi = (x(1) i , ..., x(p) i ) Rp. Computing the complexity of the B & B algorithm for any distribution of the xi would be too difficult. We make here the simplifying assumption that the xi in C are uniformly distributed on the hyper-cube C of edge size 2 and centered on x = (0, ..., 0) Rp We define for any i [|1, nc|] and any j [|1, nc|], the variables Zi j = d(xi, xj). We also define Z j = d(x , xj) for any j [|1, nc|]. We suppose that for any i [|1, nc|], Zi 1, ..., Zi nc are iid and that for any j [|1, nc|] the Zi j are independents. We define, for any i [|1, nc|], the mean µi = E[Zi 0] = 1 2p R x C d(xi, x) and the variance σ2 i = Var[Zi 0] = 1 2p R x C(d(xi, x) µi)2. We consider a first batch of distance computations of size n < n C. We define for any i [|1, nc|] the empirical mean bµi = 1 n Pn j=1 Zi j and the empirical variance bσ2 i = 1 n 1 Pn j=1(Zi j bµi)2. We denote by µ and σ 2 the mean and variance of Z 0 and bµ and bσ 2 their respective empirical estimator. We first observe that µi and σi are finite for any i [|1, nc|] as the xi are uniformly distributed on the hyper-cube centered in x C. We also notice that Zi j [0, 2] for any i, j [|1, nc|]. Preliminary results: We make the assumption that p >> 1. We aim at giving bounds for any µi and σi. We admit the intuitive results that for any xi = (x(1) i , ..., x(p) i ) C and for any j [|1, nc|], µi µj 0 i and σi σj 0 i with µj 0 i and σj 0 i the mean and variance of the variable d(xj 0 i , .) with xj 0 i = (x(1) i , ..., x(j 1) i , 0, x(j+1) i , ..., x(p) i ). We consider indeed that the more xi is close to the center of the hyper-cube smaller is µi and σi. Considering this fact, we have, for any i [|1, nc|]: µ µi (46) σ σi (47) Besides, for any 0 r 1 the sample density in the elementary surface between the balls centered on x and of respective radius r + dr and r is prp 1dr. Thus, we can notice that Z 0 follows a beta distribution of parameters α = p and β = 1, from which we deduce that: µ = p p + 1 (48) σ 2 = p (p + 1)2(p + 2) (49) We further notice that for any 0 a 1 and for any i [|1, nc|] such that d(x , xi) = a we have µi µa with µa the criterion of the sample xa = (a, 0, ..., 0) Rp. To compute µa = 1 2p R x C d(xa, x), we split the integral on three parts: d(xa, x) 1 a, 1 a d(xa, x) 1 and 1 d(xa, x) 1 + a: r=0 prp 1dr + 1 prp 1 + (p 1)(1 a)rp 2 dr + 1 = p p + 1(1 a)p+1 + 1 p p + 1 + (1 a)p 1 p (1 a)p+1 p p + 1 + p 1 Published as a conference paper at ICLR 2022 Using the simplifying approximation p p+1 p 1 p 1 for p >> 1. Thus for any 0 a 1 and for any i [|1, nc|] such that d(x , xi) = a we have: An upper bound of the σ2 i is given by the variance σ2 c of the variable d(xc, .) with xc = (1, ..., 1) which corresponds to one corner of the hyper-cube C: 0 prp+1dr 1 = 4 p p + 2 + 4 p p + 1 = 4 p (p + 1)2(p + 2) From which we deduce that for any i [|1, nc|]: σ σi 2σ (53) To simplify the calculations, we make the approximation bσi σi for any i [|1, nc|], which is relevant for sufficiently large n as E[Zi 0 4] < + . Probability of rejecting all optimal medoid candidates: Let ϵ > 0 be an approximation factor of µ . The goal of the Branch & Bound algorithm is to identify one sample xi C such that µi µ (1 + ϵ) with less distance computations as possible. The process consists in removing all candidates xi such that bµi 2bσi n > bµi + 2bσi n with bµi the current minimal empirical mean. We aim now at computing the probability of rejecting all optimal medoid candidates xi verifying µi µ (1 + ϵ) during the B & B process. For this, we define Bϵ(µ ) = {i [|1, nc|] ; µi µ (1 + ϵ)} the index set of optimal medoid candidates and Bc ϵ(µ ) = {i [|1, nc|] ; µi > µ (1 + ϵ)} the index set of sub-optimal medoid candidates. We assume that B & B returns an optimal candidate if at least one sample of Bϵ(µ ) is kept after the first batch computation. We define the two following probabilities: P1 = P i Bc ϵ(µ ) ; bµi + 4σ n µ (1 + ϵ/4) (54) P2 = P ({ i Bϵ(µ ) ; bµi µ (1 + ϵ/4)}) (55) We can observe that the probability of rejecting all optimal medoid candidates is upper bounded by: (1 P2) + P2P1 considering the approximation bσi σi and the fact that σi σ for any i [|1, nc|]. We now define, for i Bc ϵ(µ ): Pi = P bµi + 4σ n µ (1 + ϵ/4) (56) Leading to: Pi = P bµi + 4σ On the other hand, according to the Bennett s inequality from (Maurer & Pontil, 2009; Hoeffding, 1994) we have, for any i Bc ϵ(µ ) and for any δ > 0: σ2 i log(1/δ) 2n + log(1/δ) Published as a conference paper at ICLR 2022 Notice that we apply the inequality to the Zi j/2. Then, considering the fact that σi 2σ and that µi > µ (1 + ϵ) we have: log(1/δ) + 2 log(1/δ) 3n µ (1 + ϵ) Leading to: log(1/δ) ϵµ + 2 log(1/δ) Let s consider δ > 0 such that the following equality holds: log(1/δ) ϵµ + 2 log(1/δ) u2 + Au = B (62) log(1/δ) (63) We then set: A2 + 4B (68) We have for the δ defines above and for any i Bc ϵ(µ ): Thus, the probability P1 can be upper bounded as follows: P1 1 (1 δ)nc (70) Considering the fact that |Bc ϵ(µ )| < nc. 1 (1 δ)nc is the probability of getting at least one success for the binomial law of parameters (nc, δ). We are now looking for an upper bound of P2. We observe that, at least the index i such that xi = x is in Bϵ(µ ). Besides, as x is the center of the hyper-cube, Z j is in [0, 1] for any j [|1, nc|]. We can then apply the Bennett s inequality to the Z j , and for any γ > 0 we have: 2σ 2 log(1/δ) n + log(1/δ) Published as a conference paper at ICLR 2022 We then have: P2 P bµ µ (1 + ϵ 4) 1 γ (75) Leading to: 1 P2 γ (76) Finally the probability of rejecting all optimal candidates is upper bounded by 1 (1 δ)nc + γ. To give an order of magnitude of this probability, we consider the scenario where nc = 105, p = 100, n = nc and ϵ = 0.05. Then we have: δ 3.6 10 8 and γ 7.7 10 5, which leads to a probability of rejection around 3.6 10 3. Thus, in this case, there is at least a probability 0.995 that B & B returns a medoid candidate with a criterion less than 1.05 the optimal. Complexity computation: We are now looking for the number of distance computations performed by B & B. For this, we need to compute the number of xi kept at each batch. An upper bound of this number is given by the number of xi verifying bµi µ (1 + ϵ/4) + 8σ n . We further assume that the previous upper bound can be approximated by the number of xi verifying µi µ (1 + ϵ/4) + 8σ n We have shown in the preliminary results that for any i [|1, nc|], µi 1 + a2 4 with a = d(xi, x ). Thus an upper bound of the number of candidates xi kept after the first batch is given by the number of xi in the ball of radius a with a verifying: Using the approximation µ = p p+1 1. Besides, as σ = O (1/p) we can suppose that for sufficiently large p and large n, 8σ 4ϵ (For instance with nc, n, p and ϵ considered previously, we have 8σ n 0.005 and 3 4ϵ 0.03). Thus: Finally the number of candidates kept after the first batch is in O nc ϵp/2 which is very small (for the values of nc, p and ϵ considered previously, we have nc ϵp/2 10 60). If we consider a batch size of O nc , the number of distance computations after the first batch is negligible behind O nc nc . We then conclude that the complexity of a medoid computation in one cluster is in O p nc nc as each of the K cluster has approximately nc n/K samples (with n the sample size of T ), the overall complexity of B & B is in O p n3/2K 1/2 . F EMPIRICAL COMPLEXITIES This section presents the empirical time computation recorded for K-medoids Greedy, Accelerated K-medoids, K-centers and K-centers + KD-Trees which corresponds to the K-centers algorithm with initialization of the nearest source neighbour distances computed through the KD-Trees Random Forest algorithm. The experiments are conducted on the Digits data set. The experiments are run on a (2.7GHz, 16G RAM) computer using Python 3.8. The scikit-learn (Pedregosa et al., 2011) implementation of the pairwise euclidean distance is used. No parallel computing is performed. The parameters are set to K = 100, T = 50 for the KD-Trees Random Forest algorithm and an initial batch size B = 5000 for the Accelerated K-medoids algorithm. The euclidean distance is used as base distance d. The results are reported on Figure 5. The evolution of computational time is a function of the size of the source and target samples (m and n). We first observe on Figure 5 that the K-medoids Greedy algorithm encounters computational burden for samples larger than 10k instances. For n = m = 60k, the K-centers algorithm encounters Published as a conference paper at ICLR 2022 Figure 5: Visualization of empirical computational times in function of the sample sizes. a similar issue due to the computation of the distance matrix between the source and target data sets. Using the KD-Trees Random Forest algorithm decreases in this case the computational time by a factor of 5. We also notice that the accelerated K-medoids algorithm has similar complexity performance to K-medoids for n, m 5000 which is the maximum size of the initial batch. Then, the computation time of the algorithm increases slightly from n, m = 5k to n, m = 60k while remaining at an acceptable level. For n, m = 60k the complexities of the accelerated K-medoids and the K-centers + KD-Trees are almost similar but with a level of performance in favor of the accelerated K-medoids (see Section 5.4). We also present the evolution of the objective function of the different algorithm in function of the number of queries K for n = m = 60k (cf Figure 6). Figure 6: Visualization of empirical objectives in function of the number of queries for a sample size of 60k. We observe that the objective of the K-center algorithm is higher than the one of the K-medoids algorithms. We also observe that K-medoids Greedy provide slightly smaller objectives than Kmedoids Accelerated. Published as a conference paper at ICLR 2022 G EXPERIMENTS G.1 SUPERCONDUCTIVITY Setup The UCI data set Superconductivity (Hamidieh, 2018; Dua & Graff, 2017) is composed of features extracted from the chemical formula of several superconductors along with their critical temperature. There is two kind of features: some features correspond to the chemical element number s in the superconductor chemical formula, others are statistical features derived from the chemical formula as the mean and variance of the atomic mass. We use the setup of (Pardoe & Stone, 2010) to divide the data set in separate domains. We select an input feature with a moderate correlation factor with the output ( 0.3). We then sort the set according to this feature and split it in four parts: low (l), middle-low (ml), middle-high (mh), high (h). Each part defining a domain with around 4000 instances and 166 features. The considered feature is then withdrawn from the data set. A standard scaling preprocessing is performed using the source data on the input statistical features and the output feature. A max scaling is performed on features corresponding to the chemical element number s. A visualization of the first components of the PCA as well as the output distribution is provided in Figure 7). Figure 7: Visualization of the domain shift of the superconductivity data set. The visualization of the two first components of the PCA on the input features is given on the left. The output distribution is given on the right. One domain is represented by one color. Standard deviation Table 2 presents the standard deviation of the MAEs obtained on the 8 repetitions of the 12 experiments with the Balanced Weighting training and K = 20. We observe that K-medoids provides the smallest standard deviations in the majority of the experiments. K-medoids is indeed a deterministic algorithm and thus selects a determined batch of target points. Besides, K-medoids selects the target batch to label in a distribution matching perspective, i.e. it produces a training set with a distribution close to the one of the testing set (cf Section 2). This can explain why the training is more stable with the training set provided by K-medoids. Table 2: Standard deviations of the MAE on the unlabeled data for the superconductivity experiments. The deviations are computed using the results of 8 repetitions of each experiment. Experiment l ml l mh l h ml l ml mh ml h mh l mh ml mh h h l h ml h mh Random 1.007 2.599 1.597 1.232 1.679 1.291 1.863 0.787 0.266 1.900 1.228 0.595 Kmeans 1.05 0.93 0.78 1.93 0.54 0.64 1.0 1.36 0.58 6.51 1.51 0.59 QBC 1.469 2.370 1.882 0.571 0.564 0.707 2.151 0.767 0.394 7.240 3.250 0.620 Kcenters 0.807 1.761 1.926 0.626 0.343 0.794 2.188 0.885 0.503 5.403 2.838 0.920 Diversity 1.34 1.2 3.28 0.52 0.8 1.86 2.46 0.79 0.42 5.49 1.44 0.32 Kmedoids 0.59 0.76 1.21 0.54 0.55 0.74 0.64 0.33 0.26 1.19 0.8 0.15 Published as a conference paper at ICLR 2022 Table 3: MAE on the critical temperature for the Superconductivity experiments with Balanced Weighting and K = 20. Experiment l ml l mh l h ml l ml mh ml h mh l mh ml mh h h l h ml h mh Random 15.33 15.80 17.45 16.53 11.39 14.70 17.65 12.83 10.36 18.75 14.86 10.54 Kmeans 14.43 13.60 13.98 15.79 10.19 12.73 17.18 12.67 10.02 22.10 14.69 9.76 QBC 20.00 19.03 20.08 15.89 12.24 15.31 20.78 12.87 10.19 31.88 18.86 10.65 Kcenters 19.21 15.73 16.85 15.75 11.62 13.44 22.17 12.74 10.24 36.50 19.60 10.39 Diversity 19.46 18.21 18.68 16.01 11.94 15.36 23.92 14.31 10.70 37.97 20.89 10.78 Kmedoids 12.70 13.57 14.11 14.49 10.02 12.52 15.36 12.37 9.79 16.62 14.14 9.32 (high to low) analysis An interesting fact to notice in the results of the superconductivity experiments (for which a recap is given in Table 3) is the asymmetry of results between the low-to-high experiment and the high-to-low experiment. Indeed, the results on the latter experiment are worse than the ones of the former, except for the Random and K-medoids algorithms. It first appears, in Figure 8, that this difference in performance for the low-to-high experiment compared to the high-to-low experiment comes from the difference in the output distributions of the low and high domains. Indeed, the output distribution of the high domain is made of one mode concentrated in the low temperature while the output distribution of low is made of two modes, one in the low and the other in the high temperature. Thus, the model trained with the low domain data will generalize better to the high domain since the model has seen the full range of temperature. In the inverse problem (high-to-low), the model has only seen low temperatures and thus poorly predict a part of the low domain. Then, for this difficult experiment (high-to-low), the difference in performance between the different query algorithms lies mainly in the number of data with high temperature in the low domain that are queried (see Figure 9). It appears that K-medoids and Random query more of this type of data, as they are highly concentrated and relatively far from the sources (see Figure 10, 11). Figure 8: Visualization of the domain shift of the superconductivity data set for the high-to-low experiment. The visualization of the two first components of the PCA on the input features is given on the left. The output distribution is given on the right. One domain is represented by one color. Published as a conference paper at ICLR 2022 Figure 9: Visualization of the queries in the output space for the high-to-low experiment with K = 20. Figure 10: Visualization of the queries in the input space for the high-to-low experiment with K = 20. Published as a conference paper at ICLR 2022 Figure 11: Visualization of the the input space for the high-to-low experiment. The heatmap gives the output labels for the low domain. Figure 12: Office data set: examples of images from amazon domain Published as a conference paper at ICLR 2022 Figure 13: Office data set: examples of images from webcam domain Published as a conference paper at ICLR 2022 Figure 14: Digits data set: examples of SYNTH images Figure 15: Digits data set: examples of SVHN images