# tight_rates_in_supervised_outlier_transfer_learning__affabdbf.pdf Published as a conference paper at ICLR 2024 TIGHT RATES IN SUPERVISED OUTLIER TRANSFER LEARNING Mohammadreza M. Kalan Statistics, Columbia University mm6244@columbia.edu Samory Kpotufe Statistics, Columbia University samory@columbia.edu A critical barrier to learning an accurate decision rule for outlier detection is the scarcity of outlier data. As such, practitioners often turn to the use of similar but imperfect outlier data from which they might transfer information to the target outlier detection task. Despite the recent empirical success of transfer learning approaches in outlier detection, a fundamental understanding of when and how knowledge can be transferred from a source to a target outlier detection task remains elusive. In this work, we adopt the traditional framework of Neyman Pearson classification which formalizes supervised outlier detection with the added assumption that one has access to some related but imperfect outlier data. Our main results are as follows: We first determine the information-theoretic limits of the problem under a measure of discrepancy that extends some existing notions from traditional balanced classification; interestingly, unlike in balanced classification, seemingly very dissimilar sources can provide much information about a target, thus resulting in fast transfer. We then show that, in principle, these information theoretic limits are achievable by adaptive procedures, i.e., procedures with no a priori information on the discrepancy between source and target outlier distributions. 1 INTRODUCTION A primary objective in many data science applications is to learn a decision rule that separates a common class with abundant data from a rare class with limited or no data. This is a traditional problem which often appears under the umbrella term of outlier detection or rare class classification, and has seen a resurgence of interest in modern applications such as malware detection in cybersecurity and Io T (Jose et al., 2018; Kumar & Lim, 2019), fraud detection in credit card transactions (Malini & Pushpa, 2017), disease diagnosis (Bourzac, 2014; Zheng et al., 2011), among others. A main goal in these applications which distinguishes it from traditional classification where performance is asssessed on average over all classes is to achieve low classification error on the rare class, while at the same time maintaining low error w.r.t. the common class. Such a constrained objective is commonly referred to as Neyman-Pearson classification. Formally, letting µ0,µ1 denote the common and rare class distributions, Neyman-Pearson classification takes the form: Minimize µ1-error over classifiers h in some hypothesis space H subject to keeping the µ0-error of such an h under a threshold α. In this work, we focus on the common supervised setting where practitioners have access to not only training data from the common class, but also some (limited amount of) data from the rare class or, pertinently, from a related distribution they hope has information on the rare class. Henceforth, for simplicity of notation, we denote the target rare class distribution by µ1,T and the related but imperfect rare class distribution by µ1,S, where S stands for source. As an example, such related rare-class data may be from a different infected device in Io T applications, or from similar cancer types in medical applications, or from laboratory simulations of a rare class. This is thus a transfer learning problem, however for supervised outlier detection rather than for traditional classification as is usual in the literature. Published as a conference paper at ICLR 2024 Figure 1: µ0 = N(a0, σ2) is the common distribution, and µ1,S, µ1,T = N(a1,S, σ2), N(a1,T , σ2) are the source and target distributions. The universally optimal decision rules for the source and target are identical, i.e., h S,α = h T,α, in fact for any value of α [0, 1] (see Section 3). While this outlier transfer problem is quite common in applications due to the scarcity of rare class data (Su et al., 2023; Wen & Keyes, 2019; Aburakhia et al., 2020), the problem has so far received little rigorous attention. Our main aim is therefore to further theoretical understanding of this timely problem, and in particular to gain much needed insight on the extent to which related but imperfect rare class data may improve performance for the target Neyman-Pearson classification problem. Such achievable transfer performance of course must depend on how far the related rare class distribution µ1,S is from the target µ1,T somehow properly formalized and or whether the related rare-class distribution µ1,S induces similar optimal decision rules as for the target rare class distribution µ1,T . We first argue that, unlike in traditional classification, seemingly very different source and target distributions µ1,S,µ1,T may induce the same exact (universally) optimal decision rules in Neyman Pearson classification; this is obtained as a consequence of a simple extension of the classical Neyman-Pearson Lemma (Lehmann & Lehmann, 1986) to the case of transfer (see Proposition 3.6). This is illustrated in Fig 1 and explained in detail in Section 3. As a consequence, unlike in usual classification, we can approximate the universally best classifier h T,α under the target arbitrarily well asymptotically, i.e., with sufficiently large data from a seemingly unrelated source. However, the story turns out more nuanced in finite sample regimes, i.e, as we consider the rate of such approximation (in terms of relevant sample sizes), even when µ1,S and µ1,T admits the same optimal classifiers. That is, two different sources µ1,S and µ 1,S may yield considerably different transfer rates in finite-sample regimes even if both of them share the same optimal classifier as the target µ1,T : this is because a given source may yield more data near the common decision boundary h T,α than another source, thus revealing h T,α at a faster rate. In particular, we show in our first main result of Theorem 4.5 a minimax lower-bound that the rate of convergence of any outlier-transfer approach is in fact controlled by a relatively simple notion of outlier-transferexponent (adapted from transfer results in traditional classification) which essentially measures how well a source may reveal the unknown decision boundary. Theorem 4.5 is in fact rather general: the minimax lower-bound holds for any hypothesis space H of finite VC dimension (at least 3), and any number of sample sizes from µ0,µ1,S,µ1,T , including the case of no sample from the rare target class µ1,T . Moreover, the result holds generally h S,α and h T,α are the same or not. We finally turn our attention to whether such rates may be achieved adaptively, i.e., from samples alone without prior knowledge of the discrepancy between µ1,S and µ1,T as captured by both the transfer-exponent and the amount of difference between optimal classifiers h S,α and h T,α. We show in Theorem 4.6 that this is indeed the case: the minimax lower-bounds of Theorem 4.5 can be matched up to logarithmic factors by some careful adaptation approach that essentially fine-tunes classifiers learned using source data with whatever target data is available. This is described in Section 4.8. 1.1 RELATED WORK Outlier detection and transfer learning have mostly been studied separately, despite the clear need for transfer learning in applications of outlier detection where the rare class of data is by definition, always scarce. As such, transfer learning works have mostly focused on traditional classification and regression starting from seminal works of Mansour et al. (2009); David et al. (2010); Ben-David et al. (2010; 2006), to more recent works of Hanneke & Kpotufe (2019; 2022); Kalan et al. (2022); Mousavi Kalan et al. (2020); Lei et al. (2021). The works of Hanneke & Kpotufe (2019; 2022) are most closely related as our notion of outlier-transfer-exponent may be viewed as an extention Published as a conference paper at ICLR 2024 of their notion of transfer-exponent; however, besides for the fact that both notions capture discrepancies around decision boundaries, transfer in outlier detection is fundamentally different from the case of traditional classification studied in the above works: for instance, as stated earlier, distributions that are significantly different in traditional classification can be quite close in outlier transfer as revealed in this work. Theoretical works on outlier detection on the other hand have mostly focused on unsupervised and supervised settings, but without considering the more practical transfer setting. Unsupervised outlier-detection assumes that only data from the common class µ0 is available; theoretical works include studies of density-level set estimation (Steinwart et al., 2005; Polonik, 1995; Ben-David & Lindenbaum, 1997; Tsybakov, 1997) where outliers are viewed as data in low density regions, or in works on so-called one-class classification that aim to learn a contour of the common class µ0 (Sch olkopf et al., 2001). Supervised outlier-detection has commonly been formalized via Neyman Pearson classification, where some data from both the common and rare classes are used to optimize and constrain empirical errors. Early works include (Cannon et al., 2002; Scott & Nowak, 2005; Blanchard et al., 2010; Rigollet & Tong, 2011) which establish convergence rates in various distributional and model selection settings, but all exclude the question of transfer. Transfer learning for outlier detection has in fact received much recent attention in the methodological literature (Xiao et al., 2015; Andrews et al., 2016; Id e et al., 2017; Chalapathy et al., 2018; Yang et al., 2020) where various approaches have been proposed that aim to leverage shared structural aspects of source and target rare class data. On the theoretical front however, much less is understood about outlier transfer. The recent work of Scott (2019) initiates theoretical understanding of the problem: they are first to show that, in some situations where both source and target share the same optimal classifiers, various procedures can guarantee consistency (i.e., taking sample size to infinity) even as source and target µ1,S,µ1,T appear different. Our Proposition 3.6 shows that in fact optimal classifiers may be shared in even more general situations, similarly implying consistency for seemingly very different source and target rare-class distributions. Our main results of Theorems 4.5 and 4.6 reach further in deriving the first insights into the finite-sample regimes of outlier-transfer, by establishing information-theoretic limits of the problem, along with notions of discrepancies between source and target that tightly capture such limits. We first formalize the Neyman-Pearson classification framework, followed by its extension to the transfer case. 2.1 NEYMAN-PEARSON CLASSIFICATION Let µ0 and µ1 denote probability distributions on some measurable space (X,Σ). Furthermore, suppose that H is a hypothesis class consisting of measurable 0-1 functions on the domain X, where we view h(x) = 0 or 1 as predicting that x is generated from class µ0 or µ1. We view µ0 and µ1 as representing a common and rare class of (future) data. Definition 2.1. We are interested in the so-called Type-I and Type-II errors defined as follows: Rµ0(h) = µ0(h(x) = 1), Rµ1(h) = µ1(h(x) = 0). Neyman-Pearson classification then refers to the problem of minimizing Type-II error subject to low Type-I error: Minimize h H Rµ1(h) s.t. Rµ0(h) α (2.1) Under mild conditions, the universally optimal classifier, i.e., taking H as the set of all measurable 0-1 functions, is fully characterized by the classical Neyman-Pearson Lemma (see Appendix A) in terms of density ratios. Namely, let p0 and p1 denote densities of µ0 and µ1 w.r.t. some dominating measure ν, then the minimizer of (2.1) has the form h α(x) = 1{ p1(x) p0(x) λ} whenever there exists λ such that Rµ0(h α) is exactly α1. 1If we further allow for randomized classifiers, then Neyman Pearson Lemma fully characterizes universally optimal solutions of (2.1) and establishes uniqueness almost-surely under mild restrictions. Published as a conference paper at ICLR 2024 In the Section 3 we ask when such universal minimizer transfers across source and target rare-class distributions. 2.2 TRANSFER LEARNING SETUP Population Setup. We consider the following two source and target Neyman-Pearson problems, defined for a fixed common class distribution µ0, and source and target rare-class distributions µ1,S and µ1,T : Minimize h H Rµ1,S(h) s.t. Rµ0(h) α (2.2) Minimize h H Rµ1,T (h) s.t. Rµ0(h) α (2.3) We let (µ0,µ1,S,α) and (µ0,µ1,T ,α) denote these source and target problems. We will see later that the limits of outlier-transfer, especially in finite-sample regimes, are well captured by discrepancies between these two problems. In particular, we will be interested in discrepancies between optimal solutions and the measure of the corresponding decision boundaries under involved distributions. We henceforth let h S,α and h T,α denote (not necessarily unique) solutions of (2.2) and (2.3) (which we assume exist). Finite-Sample Setup. We assume access to n0,n S,n T i.i.d. data points respectively from µ0,µ1,S,µ1,T , where we allow n T = 0. The transfer-learning procedure is then allowed to return ˆh H satisfying Rµ0(ˆh) α + ϵ0, for some slack ϵ0 = ϵ0(n0), usually of order n 1/2 0 . The goal of the learner is to minimize the target-excess error E1,T (ˆh) max{0,Rµ1,T (ˆh) Rµ1,T (h T,α)}. A main aim of this work is to understand which rates of E1,T (ˆh) are achievable in terms of sample sizes n S and n T , and which notions of discrepancy from source to target helps capture these limits. 3 EQUIVALENCE OF POPULATION PROBLEMS As discussed in the introduction, we may have seemingly very different source and target distributions µ1,S and µ1,T which however yield the same (universally) optimal classifiers. We investigate this phenomena in this section, the main aim being to illustrate how fundamentally different outlier transfer is from traditional classification. To this end we consider the set U of all possible measurable 0-1 functions on X, and let H = U in the dicussion below. We will henceforth say that the source problem (µ0,µ1,S,α) is equivalent to the target problem (µ0,µ1,T ,α) (at the population level), if all solutions to (2.2) are also solutions to (2.3). Clearly, when this is the case, source samples alone can drive down the target risk at least asymptotically. We first notice that Neyman-Pearson Lemma offers some immediate answers under mild conditions. To see this, let p0,p1,S,p1,T denote densities w.r.t. a common dominating measure ν. In what follows we will simply let the dominating measure ν µ0 + µ1,S + µ1,T . As previously discussed, Neyman-Pearson Lemma characterizes optimal classifiers in terms of level sets of density ratios. Definition 3.1 (Level sets). LS λ = {x p1,S(x) p0(x) λ} and LT λ = {x p1,T (x) p0(x) λ}. Here, when the denominator of a fraction is zero we view it as infinity no matter if the nominator is zero or nonzero. The following then establishes equivalence between source and target under mild conditions as a direct consequence of Neyman-Pearson Lemma. Proposition 3.2. Suppose µ0,µ1,S,µ1,T are mutually dominating. Assume further that µ0(LS λ) is continuous and strictly monotonic in (0,1) as a function of λ. Then, if {LS λ}λ 0 {LT λ }λ 0, we have for all 0 < α < 1, that any h S,α is a solution of the target problem (2.3). The above statement was already evident in the work of Scott (2019) where it is assumed that source density ratios are given as a monotonic function of the target density ratios; this immediately implies {LS λ} {LT λ }. Published as a conference paper at ICLR 2024 Figure 2: µ0, µ1,S, µ1,T = N(a0, σ2), N(a1,S, σ2), N(a1,T , σ 2) where a1,S = a1,T and σ < σ. Optimal decision rules differ. The statement is illustrated in the examples of Fig 1 with Gaussian distributions where the source may appear significantly different from the target (and in fact would yield different Bayes classifiers in traditional classification). To contrast, consider the example Fig 2 where the source problem yields different level sets than for the target problem (simply by changing the Gaussian variance), and we hence do not have equivalence. Remark 3.1 (Issue with Disjoint Supports). We have assumed in the Proposition 3.2 that all 3 distributions are mutually dominating, while in practice this might rarely be the case. However, without this assumption (essentially of shared support), we may not easily have equivalence between source and target. For intuition, consider a region A of space where µ1,S(A) = 0 while µ1,T (A) > 0. Then let h ,0 S,α = 0 on A and h ,1 S,α = 1 both optimal under the source problem; clearly we may have R1,T (h ,0 S,α) > R1,T (h ,1 S,α) since µ1,T (A) > 0. The rest of this section is devoted to handling this issue by restricting attention to more reasonable classifiers that essentially classify any point outside the support of µ0 as 1. The right notion of support is critical in establishing our main relaxation of the above proposition, namely Proposition 3.6 below. We require the following definitions and assumptions. Definition 3.3 (Restricted Universal Hyposthesis Class). We let U consist of all 0-1 measurable functions on the domain X such that for every h U , h 1 on {x p0(x) = 0} a.s. ν. Definition 3.4. We say that α is achievable if there exist thresholds λ and λ such that µ0(LS λ) = α and µ0(LT λ ) = α. Definition 3.5 (α-level set). Whenever α is achievable, we define LS(α) as the level set in the source whose measure under µ0 is α. The definition of LT (α) which corresponds to the target is the same. Remark 3.2. Definition 3.4 ensures that LS(α) exists, but it may not be unique. However, we will show in Appendix A by Proposition A.2, it is unique a.s. ν. The following proposition relaxes the conditions of Proposition 3.2 above by restricting attention to universal classifiers in U . Its proof is technical to various corner cases described in the Appendix. Proposition 3.6. Let 0 α < 1 and suppose that α is achievable. Then (µ0,µ1,S,α) is equivalent to (µ0,µ1,T ,α) under U iff LS(α) {LT λ }λ 0 a.s. ν. In particular, if α is achievable for all 0 α < 1 and LS(α) {LT λ }λ 0 a.s. ν for all 0 α < 1, then (µ0,µ1,S,α) is equivalent to (µ0,µ1,T ,α) for all 0 α < 1. Remark 3.3. Notice that the statements of Proposition 3.6 trivially also hold over any hypothesis class H (rather than just U ) containing the level sets 1LS(α),1LT (α) H and where, for every h H, h 1 on {x p0(x) = 0} a.s. ν. We illustrate this final Proposition in Figure 3: the simple restriction to U reveals more scenarios where source is equivalent to target (at the population level) even when the distributions appear significantly different. 4 FINITE SAMPLE RESULTS Neyman-Pearson Lemma offers the solution(s) of the optimization problem (2.3) when we have the knowledge of the underlying distributions. However, in practical scenarios, we typically lack infor- Published as a conference paper at ICLR 2024 (a) µ0, µ1,S, µ1,T (b) Source and target solutions. (c) h S,α is an optimal solution for source but not for target. Figure 3: Illustration of Example A.4 (see Appendix A.3), where h S,α = h T,α in U , while h S,α U is also optimal for source but not for target. In other words, the source problem remains equivalent to the target over the more reasonable decision rules of U . mation about these distributions and only possess some samples drawn from them. In addressing this challenge, Cannon et al. (2002) embarked on an empirical study of Neyman-Pearson classification and introduced a relaxed version of Neyman-Pearson classification problem. Let n0 and n T be the number of i.i.d. samples generated by µ0 and µ1,T , respectively, and ϵ0 > 0. Cannon et al. (2002) proposed the following optimization problem: ˆh =arg min h H ˆRµ1,T (h) s.t. ˆRµ0(h) α + ϵ0 where ˆRµ0(h) = 1 n0 Xi µ0 1{h(Xi) 0} and ˆRµ1,T (h) = 1 n T Xi µ1,T 1{h(Xi) 1}, and then derived the convergence rate of excess error in terms of the number of samples and VC dimension of the hypothesis class. Neyman-Pearson classification in the setting of transfer finite-sample scenarios remains unexplored. In this section, Our objective is to understand the fundamental limits of transfer outlier detection in the finite-sample regime, where there are n0,n S,n T i.i.d. samples from µ0,µ1,S,µ1,T , under a measure of discrepancy between source and target. We first define a discrepancy measure in transfer outlier detection and then characterize the fundamental limits of the problem by deriving a minimax lower bound in terms of the number of samples as well as the notion of discrepancy. Finally, we show that this lower bound is achievable through an adaptive procedure which does not require the prior knowledge of the discrepancy between source and target. 4.1 OUTLIER TRANSFER EXPONENT In this section, we define an appropriate notion of outlier transfer distance between source and target under a hypothesis class. Here we adapt the transfer exponent notion defined in Hanneke & Kpotufe (2019) to a notion of discrepancy between source and target in Neyman-Pearson classification with shared common distribution µ0. Definition 4.1 (Outlier transfer exponent). Let S α H denote the set of solutions of source (2.2). Furthermore, let (µ0,µ1,S,α) and (µ0,µ1,T ,α) denote the source and target problems, respectively. We call ρ(r) > 0 outlier transfer exponent from (µ0,µ1,S,α) to (µ0,µ1,T ,α) under H, if there exist r,Cρ(r) > 0 such that Cρ(r) max{0,Rµ1,S(h) Rµ1,S(h S,α)} max{0,Rµ1,T (h) Rµ1,T (h S,α)} for all h H with Rµ0(h) α + r, where h S,α = arg max h S α Rµ1,T (h). The following example shows that for a fixed target and for any arbitrary ρ 1, there exists a source such that it shares the same optimal decision rule as the target and attains the outlier transfer exponent ρ with coefficient Cρ = 1. Example 4.2. Let µ0 N( 1,1), µ1,T Unif[0,1], p1,S = ρxρ 11{x [0,1]} for ρ 1 where p1,S is the density of µ1,S w.r.t. Lebesgue measure. Furthermore, let α = µ0([0,1]) ,H = {1{t x 1}(x) t Published as a conference paper at ICLR 2024 [ 1,1]}, and r be small enough. Then, we have h T,α = h S,α = 1{0 x 1}. Moreover, for h = 1{t x 1} for t 0 we obtain Rµ1,T (h) Rµ1,T (h S,α) = t and Rµ1,S(h) Rµ1,S(h S,α) = tρ. Hence, the outlier transfer exponent is ρ and the coefficient Cρ is 1. Following proposition shows the effect of r on the outlier transfer exponent ρ(r). In some cases, for a small enough value of r, ρ(r) could be small, whereas for a large value of r, ρ(r) is infinite. Proposition 4.2. There exist µ0,µ1,S,µ1,T ,H,α,r such that for any h H with Rµ0(h) α + r, (4.2) holds for ρ(r) = 1 and there exists an h H with Rµ0(h) > α+r for which (4.2) does not hold for any ρ < . 4.3 MINIMAX LOWER BOUND FOR OUTLIER TRANSFER LEARNING Equipped with the notion of outlier transfer exponent, we characterize the fundamental limits of transfer outlier detection by establishing a minimax lower bound. To achieve this objective, first we need to specify the class of distributions for which the minimax lower bound is derived. Definition 4.3 (Class of distributions). Fix a hypothesis class H with finite VC dimension d H. Let FH(ρ,α,C, ) denote the class of triplets of distributions (µ0,µ1,S,µ1,T ) for which α is achievable according to Definition 3.4, E1,T (h S,α) , and there exist ρ(r) ρ,Cρ(r) C for any 0 < r < 2α d H according to Definition 4.1. Remark 4.4. Deriving a minimax lower bound for the class of distributions satisfying α achievability is a stronger result than for the class without necessarily satisfying that, as the former is contained in the latter. We also need to formalize the class of learners. Definition 4.4. Let 0 ϵ0 < 1, 0 < δ0 < 1. We call ˆh an (ϵ0,δ0)-approximate α-learner if it maps any three independent i.i.d. samples Sµ0 µn0 0 , Sµ1,S µn S 1,S, Sµ1,T µn T 1,T , to a function in Hα+ϵ0(µ0) = {h H Rµ0(h) α + ϵ0} with probability at least 1 δ0 w.r.t. Sµ0,Sµ1,S,Sµ1,T . Theorem 4.5 (Minimax lower bound). Fix a hypothesis class H with finite VC dimension d H 3. Moreover, let α < 1 2,ρ 1, 1, and δ0 > 0. Furthermore, suppose that there are n0,n S,n T i.i.d. samples from µ0,µ1,S,µ1,T , denoted by Sµ0,Sµ1,S,Sµ1,T . Assume that there are sufficiently many samples such that min{ + ( d H n S ) 1 2ρ ,( d H n T ) 1 2 } 2. Then, for any ( 2α d H ,δ0)-approximate α-learner ˆh, there exist (µ0,µ1,S,µ1,T ) FH(ρ,α,1, ) and universal constants c,c such that P Sµ0,Sµ1,S ,Sµ1,T (E1,T (ˆh) > c min{ + (d H n S ) 1 2ρ ,(d H n T ) 1 2 }) > c . (4.3) Remark 4.5. In Theorem 4.5, if (µ0,µ1,S,α) is equivalent to (µ0,µ1,T ,α) under H (see Remark 3.3), then E1,T (h S,α) = 0 and the minimax lower bound reduces to c min{( d H n S ) 1 2ρ ,( d H n T ) 1 2 }. In this case, by only having access to unlimited source samples, achieving an arbitrary small target-excess error is possible. Remark 4.6. The outlier transfer exponent in the term ( d H n S ) 1 2ρ captures the relative effectiveness of source samples in the target domain. If source and target share the same optimal decision rules, and ρ = 1, source samples would be equally effective as target samples. However, even if the source and target share the same optimal decision rules, source samples may result in poor transfer performance when ρ is large. Remark 4.7. In Theorem 4.5, the learner is allowed to output a classifier ˆh with a Type-I error that slightly exceeds the pre-defined threshold α. However, in certain applications, it is imperative to uphold the threshold without any exceeding. The minimax lower bound in Theorem 4.5, implies that (4.3) holds even if the learner is only allowed to output a classifier ˆh from {h H Rµ0(h) α} with probability 1 δ0, i.e., (0,δ0)-approximate α-learner. Published as a conference paper at ICLR 2024 4.8 ADAPTIVE RATE (UPPER BOUND) In Section 4.3, we establish a minimax lower bound that can be attained by an oracle that effectively disregards the least informative dataset from either the source or target. Let ˆHα+ϵ0(µ0) = {h H ˆRµ0(h) α + ϵ0}, ˆh T = arg min h ˆ Hα+ϵ0/2(µ0) ˆRµ1,T (h), and ˆh S = arg min h ˆ Hα+ϵ0/2(µ0) ˆRµ1,S(h). Then we get ˆh S, ˆh T Hα+ϵ0(µ0), and E1,T (ˆh S) E1,T (h S,α) + (d H n S ) 1 2ρ and E1,T (ˆh T ) (d H with high probability. However, Deciding whether ˆh S or ˆh T achieves a smaller error requires the knowledge of outlier transfer exponent and the target-excess error of the optimal source decision rule, which are not available in practical scenarios. In this section, we show that by using an adaptive procedure that takes source and target samples as input and produces a hypothesis ˆh H without using any additional information such as prior knowledge of the outlier transfer exponent, the minimax lower bound 4.3 is achievable. To accomplish this, we adapt the procedure introduced in Hanneke & Kpotufe (2019). Let δ > 0 and define 128d H log n + log(8/δ) Consider the following procedure: Define ˆh = ˆh S if ˆRµ1,T (ˆh S) ˆRµ1,T (ˆh T ) An T , otherwise, define ˆh = ˆh T (4.4) Theorem 4.6. Let H be a hypothesis class with finite VC dimension d H 3. Furthermore, let (µ0,µ1,S,α) and (µ0,µ1,T ,α) denote a source and a target problem. Suppose that there are n0,n S,n T i.i.d. samples from µ0,µ1,S,µ1,T , respectively. Let δ0,δ > 0, ϵ0 = 128 d H log n0+log(8/δ0) n0 . Moreover, suppose that there exist r ϵ0, Cρ(r), and ρ(r) according to Definition 4.1. Let ˆh be the hypothesis returned by Procedure (4.4). Then, with probability at least 1 δ0, ˆh Hα+ϵ0(µ0). Furthermore, with probability at least 1 δ0 2δ we have E1,T (ˆh) min{E1,T (h S,α) + C A 1 ρ(r) n S ,C An T } (4.5) where C (0, ) is a constant depending on r,Cρ(r),ρ(r). Proof. Consider the intersection of the following events which happens with probability at least 1 δ0 2δ (Devroye et al., 2013): 1) {sup h H Rµ0(h) ˆRµ0(h) ϵ0 2 }, 2) {sup h H Rµ1,S(h) ˆRµ1,S(h) 2 }, and 3) {sup h H Rµ1,T (h) ˆRµ1,T (h) An T 2 }. Since h S,α ˆHα+ϵ0/2(µ0), we get ˆRµ1,S(ˆh S) ˆRµ1,S(h S,α). Therefore, we obtain Rµ1,S(ˆh S) Rµ1,S(h S,α) [Rµ1,S(ˆh S) ˆRµ1,S(ˆh S)] + [ ˆRµ1,S(h S,α) Rµ1,S(h S,α)] An S. Furthermore, we have Rµ0(ˆh S) α + r, which implies that Rµ1,T (ˆh S) Rµ1,T (h S,α) c A 1 ρ(r) n S . Rµ1,T (ˆh S) Rµ1,T (h T,α) = Rµ1T (ˆh S) Rµ1,T (h S,α) + Rµ1,T (h S,α) Rµ1,T (h T,α) E1,T (h S,α) + c A 1 ρ(r) n S . (4.6) Published as a conference paper at ICLR 2024 Now if Rµ1,T (ˆh S) Rµ1,T (ˆh T ), we get ˆRµ1,T (ˆh S) ˆRµ1,T (ˆh T ) ˆRµ1,T (ˆh S) Rµ1,T (ˆh S) + Rµ1,T (ˆh T ) ˆRµ1,T (ˆh T ) An T which shows that the constraint in Procedure (4.4) holds, implying that ˆh = ˆh S and the upper bound (4.6) holds for Rµ1,T (ˆh) Rµ1,T (h T,α). On the other hand, if Rµ1,T (ˆh S) > Rµ1,T (ˆh T ), then Rµ1,T (ˆh T ) Rµ1,T (h T,α) < Rµ1,T (ˆh S) Rµ1,T (h T,α). So, regardless of whether ˆh = ˆh S or ˆh = ˆh T , the upper bound (4.6) holds for Rµ1,T (ˆh) Rµ1,T (h T,α). Moreover, Since ˆh satisfies the constraint in Procedure (4.4), we get Rµ1,T (ˆh) Rµ1,T (h T,α) = Rµ1,T (ˆh) Rµ1,T (ˆh T ) + Rµ1,T (ˆh T ) Rµ1,T (h T,α) 3ϵT which completes the proof. 5 OVERVIEW OF PROOF OF THEOREM 4.5 (MINIMAX LOWER BOUND) We follow Tysbakov s method (Tsybakov, 2009) for derivng the minimax lower bound. Theorem 5.1. Tsybakov (2009) Assume that M 2 and the function dist( , ) is a semi-metric. Furthermore, suppose that {Πθj}θj Θ is a family of distributions indexed over a parameter space, Θ, and Θ contains elements θ0,θ1,...,θM such that: (i) dist(θi,θj) 2s > 0, 0 i < j M (ii) Πj Π0, j = 1,...,M, and 1 M M j=1 Dkl(Πj Π0) γ log M with 0 < γ < 1/8 and Πj = Πθj, j = 0,1,...,M and Dkl denotes the KL-divergence. Then, we have inf ˆθ sup θ Θ Πθ(dist(ˆθ,θ) s) 2γ log M ). We also utilize the following proposition for constructing a packing of the parameter space. Proposition 5.2. (Gilbert-Varshamov bound) Let d 8. Then there exists a subset {σ0,...,σM} of { 1,+1}d such that σ0 = (1,1,...,1), dist(σj,σk) d 8, 0 j < k M and M 2d/8, where dist(σ,σ ) = card(i [m] σ(i) σ (i)) is the Hamming distance. First note that min{ + (d H n S ) 1 2ρ ,(d H n T ) 1 2 } 2 min{max{ ,(d H n S ) 1 2ρ },(d H n T ) 1 2 } = 2 max{min{(d H n S ) 1 2ρ ,(d H n T ) 1 2 },min{ ,(d H n T ) 1 2 }} So it suffices to show that the minimax lower bound is larger than both min{( d H n S ) 1 2ρ ,( d H n T ) 1 2 } and min{ ,( d H n T ) 1 2 }. We divide the proof into three parts: Minimax lower bound is larger than min{( d H n S ) 1 2ρ ,( d H n T ) 1 2 } for d H 17 (see Section C.1). Minimax lower bound is larger than min{( d H n S ) 1 2ρ ,( d H n T ) 1 2 } for 16 d H 3 (see Section C.2). Minimax lower bound is larger than min{ ,( d H n T ) 1 2 } (see Section C.3). In each part, Following Theorem 5.1, we construct a family of pairs of source and target distributions that belong to the class FH. To accomplish this, we pick some points from the domain X shattered by the hypothesis class H and then define appropriate distributions on these points. Additionally, this family of distributions is indexed by { 1,+1}d H, which can be treated as a metric space using the Hamming distance. To meet the requirement of condition (i) in Theorem 5.1, it is necessary for these indices to be well-separated, a condition that can be satisfied through utilizing Proposition 5.2. Published as a conference paper at ICLR 2024 Sulaiman Aburakhia, Tareq Tayeh, Ryan Myers, and Abdallah Shami. A transfer learning framework for anomaly detection using model of normality. In 2020 11th IEEE Annual Information Technology, Electronics and Mobile Communication Conference (IEMCON), pp. 0055 0061. IEEE, 2020. Jerone Andrews, Thomas Tanay, Edward J Morton, and Lewis D Griffin. Transfer representationlearning for anomaly detection. JMLR, 2016. Shai Ben-David and Michael Lindenbaum. Learning distributions by their density levels: A paradigm for learning without a teacher. Journal of Computer and System Sciences, 55(1):171 182, 1997. Shai Ben-David, John Blitzer, Koby Crammer, and Fernando Pereira. Analysis of representations for domain adaptation. Advances in neural information processing systems, 19, 2006. Shai Ben-David, John Blitzer, Koby Crammer, Alex Kulesza, Fernando Pereira, and Jennifer Wortman Vaughan. A theory of learning from different domains. Machine learning, 79(1-2):151 175, 2010. Gilles Blanchard, Gyemin Lee, and Clayton Scott. Semi-supervised novelty detection. The Journal of Machine Learning Research, 11:2973 3009, 2010. Katherine Bourzac. Diagnosis: early warning system. Nature, 513(7517):S4 S6, 2014. Adam Cannon, James Howse, Don Hush, and Clint Scovel. Learning with the neyman-pearson and min-max criteria. 2002. Raghavendra Chalapathy, Aditya Krishna Menon, and Sanjay Chawla. Anomaly detection using one-class neural networks. ar Xiv preprint ar Xiv:1802.06360, 2018. Shai Ben David, Tyler Lu, Teresa Luu, and D avid P al. Impossibility theorems for domain adaptation. In Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics, pp. 129 136. JMLR Workshop and Conference Proceedings, 2010. Luc Devroye, L aszl o Gy orfi, and G abor Lugosi. A probabilistic theory of pattern recognition, volume 31. Springer Science & Business Media, 2013. Steve Hanneke and Samory Kpotufe. On the value of target data in transfer learning. In Advances in Neural Information Processing Systems, pp. 9867 9877, 2019. Steve Hanneke and Samory Kpotufe. A no-free-lunch theorem for multitask learning. The Annals of Statistics, 50(6):3119 3143, 2022. Tsuyoshi Id e, Dzung T Phan, and Jayant Kalagnanam. Multi-task multi-modal models for collective anomaly detection. In 2017 IEEE International Conference on Data Mining (ICDM), pp. 177 186. IEEE, 2017. Shijoe Jose, D Malathi, Bharath Reddy, and Dorathi Jayaseeli. A survey on anomaly based host intrusion detection system. In Journal of Physics: Conference Series, volume 1000, pp. 012049. IOP Publishing, 2018. Seyed Mohammadreza Mousavi Kalan, Mahdi Soltanolkotabi, and A Salman Avestimehr. Statistical minimax lower bounds for transfer learning in linear binary classification. In 2022 IEEE International Symposium on Information Theory (ISIT), pp. 282 287. IEEE, 2022. Ayush Kumar and Teng Joon Lim. Edima: Early detection of iot malware network activity using machine learning techniques. In 2019 IEEE 5th World Forum on Internet of Things (WF-Io T), pp. 289 294. IEEE, 2019. Erich Leo Lehmann and EL Lehmann. Testing statistical hypotheses, volume 2. Springer, 1986. Published as a conference paper at ICLR 2024 Qi Lei, Wei Hu, and Jason Lee. Near-optimal linear regression under distribution shift. In International Conference on Machine Learning, pp. 6164 6174. PMLR, 2021. N Malini and M Pushpa. Analysis on credit card fraud identification techniques based on knn and outlier detection. In 2017 third international conference on advances in electrical, electronics, information, communication and bio-informatics (AEEICB), pp. 255 258. IEEE, 2017. Yishay Mansour, Mehryar Mohri, and Afshin Rostamizadeh. Domain adaptation: Learning bounds and algorithms. ar Xiv preprint ar Xiv:0902.3430, 2009. Mohammadreza Mousavi Kalan, Zalan Fabian, Salman Avestimehr, and Mahdi Soltanolkotabi. Minimax lower bounds for transfer learning with linear and one-hidden layer neural networks. Advances in Neural Information Processing Systems, 33:1959 1969, 2020. Wolfgang Polonik. Measuring mass concentrations and estimating density contour clusters-an excess mass approach. The annals of Statistics, pp. 855 881, 1995. Philippe Rigollet and Xin Tong. Neyman-pearson classification, convexity and stochastic constraints. Journal of Machine Learning Research, 2011. Bernhard Sch olkopf, John C Platt, John Shawe-Taylor, Alex J Smola, and Robert C Williamson. Estimating the support of a high-dimensional distribution. Neural computation, 13(7):1443 1471, 2001. Clayton Scott. A generalized neyman-pearson criterion for optimal domain adaptation. In Algorithmic Learning Theory, pp. 738 761. PMLR, 2019. Clayton Scott and Robert Nowak. A neyman-pearson approach to statistical learning. IEEE Transactions on Information Theory, 51(11):3806 3819, 2005. Ingo Steinwart, Don Hush, and Clint Scovel. A classification framework for anomaly detection. Journal of Machine Learning Research, 6(2), 2005. Yueyang Su, Di Yao, and Jingping Bi. Transfer learning for region-wide trajectory outlier detection. IEEE Access, 2023. Alexandre B Tsybakov. On nonparametric estimation of density level sets. The Annals of Statistics, 25(3):948 969, 1997. Alexandre B Tsybakov. Introduction to Nonparametric Estimation. Springer series in statistics. Springer, Dordrecht, 2009. doi: 10.1007/b13794. Tailai Wen and Roy Keyes. Time series anomaly detection using convolutional neural networks and transfer learning. ar Xiv preprint ar Xiv:1905.13628, 2019. Yanshan Xiao, Bo Liu, Philip S Yu, and Zhifeng Hao. A robust one-class transfer learning method with uncertain data. Knowledge and Information Systems, 44(2):407 438, 2015. Ziyi Yang, Iman Soltani Bozchalooi, and Eric Darve. Anomaly detection with domain adaptation. ar Xiv preprint ar Xiv:2006.03689, 2020. Yefeng Zheng, Maciej Loziczonek, Bogdan Georgescu, S Kevin Zhou, Fernando Vega-Higuera, and Dorin Comaniciu. Machine learning based vesselness measurement for coronary artery segmentation in cardiac ct volumes. In Medical Imaging 2011: Image Processing, volume 7962, pp. 489 500. Spie, 2011. Published as a conference paper at ICLR 2024 A APPENDIX A (EQUIVALENCE OF POPULATION PROBLEMS) We begin by stating Neyman-Pearson lemma (Lehmann & Lehmann, 1986) for deterministic tests. In the following, a classifier h X {0,1} aims at classifying H0 µ0 against the alternative H1 µ1. In the context of hypothesis testing, h is called a deterministic test. Moreover, Rµ0(h) and 1 Rµ1(h) are called size and power, respectively. Theorem A.1 (Neyman-Pearson Lemma (Lehmann & Lehmann, 1986)). Let µ0 and µ1 be probability distributions possessing densities p0 and p1 respectively with respect to a dominating measure ν. (i) Sufficient condition for an optimal solution of the optimization problem (2.1). Let h be a classifier for H0 µ0 against the alternative H1 µ1 such that for a constant λ the followings hold Rµ0(h) = α (A.1) and h(x) = {1 when p1(x) λp0(x) 0 when p1(x) < λp0(x) (A.2) Then h is an optimal solution of (2.1). (ii) Necessary condition for an optimal solution of the optimization problem (2.1). Suppose that there exist a classifier h and a constant λ such that (A.1) and (A.2) hold. Then, any solution of (2.1), denoted by h , satisfies the following a.s. ν h (x) = {1 when p1(x) > λp0(x) 0 when p1(x) < λp0(x) (A.3) h also satisfies Rµ0(h ) = α unless there exists a classifier h with Rµ0(h ) < α and Rµ1(h ) = 0. Remark A.1. To obtain a solution of the optimization problem (2.2 ) in the source (or (2.3) in the target) it suffices to take the level set LS λ (LT λ in the target) whose measure under µ0 is α, and then define a classifier h as h = 1LS λ in the source (h = 1LT λ in the target). Obviously, the classifier h satisfies (A.1) and (A.2), and therefore it is an optimal solution. Proposition A.2. Suppose that there exist a classifier h with h 1 on {x p0(x) = 0} a.s. ν and a constant λ such that (A.1) and (A.2) hold for h. Then any solution of (2.1), denoted by h , such that h 1 on {x p0(x) = 0} a.s. ν and Rµ0(h ) = α satisfies (A.2) a.s. ν. Proof. Let S1 = {x h (x) = h(x)}, S2 = {x h (x) h(x),p1(x) λp0(x)}, S3 = {x h (x) h(x),p1(x) = λp0(x)}. By Neyman-Pearson Lemma part (ii), we know that ν(S2) = 0. Moreover, we have = S1 h p0dν + S2 h p0dν + S3 h p0dν Since ν(S2) = 0, we conclude that S2 h p0dν = 0. Furthermore, on S3 we have h 1 and h h . So h 0 on S3 and S3 h p0dν = 0. Hence, α = S1 h p0dν = S1 hp0dν On the other hand, we have α = hp0dν = S1 hp0dν + S2 hp0dν + S3 hp0dν = α + S3 hp0dν Therefore, S3 hp0dν = S3 p0dν = 0, because h 1 on S3. We claim that ν(S3) = 0. By contradiction assume that ν(S3) > 0. First we show that p0 is positive on S3 a.s. ν. The reason is that on S3 we have h 1 and h h . In addition, for x satisfying p0(x) = 0, we have h(x) = h (x) = 1 a.s. ν. Therefore, p0 must be positive on S3 a.s. ν. However, we have S3 p0dν = 0 which cannot be true since ν(S3) > 0 and p0 > 0 a.s. ν on S3. Hence, we conclude that ν(S3) = 0. Finally, we obtain that ν(S2 S3) = 0, where S2 S3 = {x h (x) h(x)}. Published as a conference paper at ICLR 2024 Remark A.2. Proposition A.2 implies that LS(α) in Definition 3.5 is unique a.s. ν. Now we are ready to prove Proposition 3.6 in Section 3, which characterizes equivalent source and target. First we show the following technical lemma. Lemma A.3. Suppose that α is achievable. If α < 1, then (2.2) cannot have any other solution h with Rµ0(h) < α and Rµ1,S(h) = 0. Proof of Lemma A.3. By contradiction, suppose that there exists an optimal solution h1 of (2.2) with Rµ0(h1) < α and Rµ1,S(h1) = 0. Furthermore, let h = 1LS(α). Since both h and h1 are optimal solutions of (2.2), we have Rµ1,S(h) = Rµ1,S(h1) = 0. By Neyman-Pearson Lemma part (ii), we know that h1 = h on {x p0(x) λp1,S(x)} a.s. ν. Let us define the sets S1 = {x h1(x) h(x)} and S2 = {x p0(x) = λp1,S(x)}. Then we have S1 S2 a.s. ν. Since h 1 on S2, we must have h1 0 on S1 a.s. ν. From Rµ1,S(h) = Rµ1,S(h1) = 0 we conclude that µ1,S(S1) = 0 and from Rµ0(h1) < Rµ0(h) = α we conclude that µ0(S1) > 0. Let S 1 = {x S1 p0(x) > 0}. Hence µ0(S 1) > 0 and ν(S 1) > 0. Then, let us define the set A = {x x S 1,p1,S(x) = 0}. We have µ1,S(S 1) = S 1 p1,Sdν = S 1/A p1,Sdν = 0. Since p1S > 0 on S1/A, we conclude that ν(S 1/A) = 0 which implies that ν(A) > 0. On A, p1,S 0 and p0 > 0 and h 1 a.s. ν. Hence, we should have λ = 0 which implies that µ0(LS(α)) = 1. However, we assumed that α < 1. Proof of Proposition 3.6. (Sufficiency) Suppose that LS(α) {LT λ }λ 0 a.s. ν. Due to Neyman Pearson Lemma part (i), 1LS(α) is an optimal solution of (2.2). Let h S U be any arbitrary optimal solution of (2.2). First we consider the case that Rµ1,S(h S) > 0 (or the power of h S in the source problem is less than 1). By Neyman-Pearson Lemma part (ii) we have Rµ0(h S) = α. Then by Proposition A.2, h S = 1LS(α) a.s. ν. We claim that h S is an optimal solution of (2.3). Since LS(α) {LT λ }λ 0, there exists LT λ such that LS(α) = LT λ a.s. ν. Hence, µ0(LT λ ) = α and 1LT λ is an optimal solution of (2.3). Furthermore, h S = 1LS(α) = 1LT λ a.s. ν which implies that h S is an optimal solution of (2.3). If Rµ1,S(h S) = 0 and Rµ0(h S) = α, it would be similar to the previous case. Furthermore, by Lemma A.3, we cannot have a solution h S with Rµ1,S(h S) = 0 and Rµ0(h S) < α. (Necessity) Suppose that any optimal solution of (2.2) is also an optimal solution of (2.3). Since 1LS(α) is an optimal solution of (2.2), we conclude that it is also an optimal solution of (2.3). Since Rµ0(1LS(α)) = α, by Proposition A.2 and α achievability, 1LT (α) = 1LS(α) a.s. ν. Therefore, LS(α) = LT (α) a.s. ν and LS(α) {LT λ }λ 0 a.s. ν. A.3 EXAMPLE CORRESPONDING TO FIG 3 Example A.4. Let ν be the Lebesgue measure, α = 1 16, and U be all the measurable 0-1 functions on R. Furthermore, let µ1,S Unif[1,2], µ1,T Unif[ 4 Then, we have LS(α) = LT (α) = ( , 2] [ 3 2,+ ). Consider the hypothesis h = 1{x [ 3 2 ,2]} which is a solution in the source but not in the target. However, source is equivalent to target under U . Published as a conference paper at ICLR 2024 B APPENDIX B (OUTLIER TRANSFER EXPONENT) B.1 PROOF OF PROPOSITION 4.2 Proof. Let µ0 N(0,1), µ1,S U[0,1], µ1,T A1 U[t1,2t0 1] + A2 U[2t0 1,1] where A2 > A1 > 0, t1 > 0, 1 2 < t0 < 1, 2t0 1 > t1 and A1(2t0 t1 1) + A2(2 2t0) = 1. Moreover, H = {1{x [a,1] [b,t0]}(x) t0 a 1,t1 b t0}. Let α = µ0([t0,1]) and r < µ0([2t0 1,t0]) α. Clearly by Neyman-Pearson Lemma the unique source and target optimal solutions are h S,α = h T,α = 1{t0 x 1}. Then for any h with Rµ0(h) α + r, h is of the form h = 1{x [a,1] [b,t0]} for some a [t0,1] and b [2t0 1,t0]. Hence, Rµ1,S(h) Rµ1,S(h S,α) = a + b 2t0 and Rµ1,T (h) Rµ1,T (h S,α) = A2(a + b 2t0) which implies that ρ(r) = 1. However, if we take h = 1{2t0 1 ϵ x t0}(x) for small enough 0 < ϵ < (1 t0)(A2 A1) A1 , which violates the condition Rµ0(h) α + r, (4.2) does not hold for any ρ < . C APPENDIX C: PROOF OF THEOREM 4.5 (LOWER BOUND) C.1 MINIMAX LOWER BOUND IS LARGER THAN min{( d H n S ) 1 2ρ ,( d H n T ) 1 2 } FOR d H 17 Let d = d H 1 and d H be odd (If d H is even then define d = d H 2). Then pick d H points S = {x0,x1,1,...,x1, d 2 ,x2,1,...,x2, d 2 } from X shattered by H (if d H is even then we pick d H 1 points). Moreover, let H be the projection of H onto the set S with the constraint that all h H classify x0 as 0. Next we construct a distribution µ0 and a family of pairs of distributions (µσ 1,S,µσ 1,T ) indexed by σ { 1,+1} d 2 . In the following, we fix ϵ = c1 min{( d H n S ) 1 2ρ ,( d H n T ) 1 2 } for a constant c1 < 1 to be determined. Distribution µ0: We define µ0 on S as follows: µ0(x1,i) = µ0(x2,i) = 2α d for i = 1,..., d and µ0(x0) = 1 2α. Distribution µσ 1,T : We define µσ 1,T on S as follows: µσ 1,T (x1,i) = 1 d + (σi/2) ϵ d for i = 1,..., d µσ 1,T (x2,i) = 1 d for i = 1,..., d and µσ 1T (x0) = 0. Distribution µσ 1,S: We define µσ 1,S on S as follows: µσ 1,S(x1,i) = 1 d + (σi/2) ϵρ d for i = 1,..., d µσ 1,S(x2,i) = 1 d (σi/2) ϵρ d for i = 1,..., d and µσ 1S(x0) = 0. Verifying the transfer distance condition. For any σ { 1,+1} d 2 , let hσ H be the minimizer of Rµσ 1,S and Rµσ 1,T with type-I error w.r.t. µ0 at most α. Then hσ satisfies the following: hσ(x1,i) = 1 hσ(x2,i) = {1 if σi = 1 0 otherwise for i = 1,..., d Published as a conference paper at ICLR 2024 For any ˆh H with α 2α d < µ0(ˆh) < α + 2α d , we have µ1,T (hσ = 1) µ1,T (ˆh = 1) = k µ1,S(hσ = 1) µ1,S(ˆh = 1) = k for some non-negative integer k d 2. So the outlier transfer exponent is ρ with Cρ = 1. The condition is also satisfied for ˆh H with µ0(ˆh) α 2α d . In this case we have µ1T (hσ = 1) µ1T (ˆh = 1) = k1 µ1S(hσ = 1) µ1S(ˆh = 1) = k1 for some integers k1 d 2 and k2 d. Using inequality (a + b)ρ 2ρ 1(aρ + bρ) the condition can be easily verified. Reduction to a packing. Any classifier ˆh S {0,1} can be reduced to a binary sequence in the domain { 1,+1}d. We can first map ˆh to (ˆh(x1,1), ˆh(x1,2),..., ˆh(x1, d 2 ), ˆh(x2,1),..., ˆh(x2, d 2 )) and then convert any element 0 to 1. We choose the Hamming distance as the distance required in Theorem 5.1. By applying Proposition 5.2 we can get a subset Σ of { 1,+1} d 2 with Σ = M 2d/16 such that the hamming distance of any two σ,σ Σ is at least d/16. Any σ,σ Σ can be mapped to binary sequences in the domain {+1, 1}d by replicating and negating, i.e., (σ, σ),(σ , σ ) {+1, 1}d and the hamming distance of resulting sequences in the domain {+1, 1}d is at least d/8. Then for any ˆh H with µ0(ˆh = 1) < α+ 2α d and σ Σ, if the hamming distance of the corresponding binary sequence of ˆh and σ in the domain {+1, 1}d is at least d/8 then we have µ1,T (hσ = 1) µ1,T (ˆh = 1) d In particular, for any σ,σ Σ we have µ1,T (hσ = 1) µ1,T (hσ = 1) d KL divergence bound. Define Πσ = (µσ 1,S)n S (µσ 1,T )n T . For any σ,σ Σ, our aim is to bound the KL divergence of Πσ,Πσ . We have Dkl(Πσ Πσ ) = n S Dkl(µσ 1,S µσ 1,S) + n T Dkl(µσ 1,T µσ 1,T ) The distribution µσ 1,S can be expressed as P σ X P σ Y X where P σ X is a uniform distribution over the set {1,2,..., d 2} and P σ Y X=i is a Bernoulli distribution with parameter 1 2 (σi/2) ϵρ. Hence we get Dkl(µσ 1,S µσ 1,S) = 1 d/2 Dkl(Ber(1 2 (σi/2) ϵρ) Ber(1 2 (σ i/2) ϵρ)) 4c0c2ρ 1 d H for some numerical constant c0. Using the same argument we can obtain Dkl(µσ 1,T µσ 1,T ) c0c2 1 d n T . Hence we get Dkl(Πσ Πσ ) 2c0c1d. Published as a conference paper at ICLR 2024 Then, for sufficiently small c1 we get Dkl(Πσ Πσ ) 1 8 log M which satisfies condition (ii) in Proposition 5.1. Therefore, for any learner that outputs a hypothesis ˆh from {h H µ0(h) α+ 2α d H } with probabil- ity 1 δ0, there exist (µ0,µ1,S,µ1,T ) FH(ρ,α,1, ) such that condition on ˆh {h H µ0(h) α + 2α d H } we have P Sµ0,Sµ1,S ,Sµ1,T (E1,T (ˆh) > c min{ + (d H n S ) 1 2ρ ,(d H n T ) 1 2 }) c which implies that the unconditional probability is as follows P Sµ0,Sµ1,S ,Sµ1,T (E1,T (ˆh) > c min{ + (d H n S ) 1 2ρ ,(d H n T ) 1 2 }) (1 δ0)c c C.2 MINIMAX LOWER BOUND IS LARGER THAN min{( d H n S ) 1 2ρ ,( d H n T ) 1 2 } FOR 16 d H 3 Pick three points S = {x0,x1,x2} from X shattered by H. Then we construct a distribution µ0 and two pairs of distributions (µk 1,S,µk 1,T ) for k = 1,1. Also fix ϵ = c1 min{( 1 n S ) 1 2ρ ,( 1 n T ) 1 2 } for a constant c1 < 1 to be determined. Distribution µ0: We define µ0 on S as follows: µ0(x0) = 1 2α, µ0(x1) = µ0(x2) = α Distribution µk 1,T : We define µk 1,T on S as follows: µk 1,T (x0) = 0, µk 1,T (x1) = 1 2 ϵ, µk 1,T (x2) = 1 Distribution µk 1,S: We define µk 1,S on S as follows: µk 1,S(x0) = 0, µk 1,S(x1) = 1 2 ϵρ, µk 1,S(x2) = 1 Let Πk = (µk 1,S)n S (µk 1,T )n T for k = 1,1. Then using the same argument we get Dkl(Π 1 Π1) c where c is a numerical constant. Furthermore, let hk be the optimal solution with type-I error at most α for the distributions (µ0,µk 1,S) and (µ0,µk 1,T ). It is easy to see that Rµk 1,T (h k) Rµk 1,T (hk) = ϵ. Using Le Cam s method we get that for any ˆh chosen from Hα = {h H µ0(h) α + 2α 3 } there exist (µ0,µ1,S,µ1,T ) FH(ρ,α,1,0) such that P Sµ0,Sµ1,S ,Sµ1,T (E1,T (ˆh) > c min{( 1 n S ) 1 2ρ ,( 1 n T ) 1 2 }) c Since d H 16 we conclude that P Sµ0,Sµ1,S ,Sµ1,T (E1,T (ˆh) > c min{(d H n S ) 1 2ρ ,(d H n T ) 1 2 }) c for some numerical constants c,c . C.3 MINIMAX LOWER BOUND IS LARGER THAN min{ ,( d H n T ) 1 2 } We only show it for the case where d H 17. The other case follows the same idea as in Section C.2. We follow the same idea as in the previous part. Let ϵ = c1 min{ ,( d H n T ) 1 2 } and pick the same set S from X shattered by H construct the distributions on S as follows: Published as a conference paper at ICLR 2024 Distribution µ0: We define µ0 on S as follows: µ0(x1,i) = µ0(x2,i) = 2α d for i = 1,..., d and µ0(x0) = 1 2α. Distribution µσ 1,T : We define µσ 1,T on S as follows: µσ 1,T (x1,i) = 1 d + (σi/2) ϵ d for i = 1,..., d µσ 1,T (x2,i) = 1 d for i = 1,..., d and µσ 1,T (x0) = 0. Distribution µσ 1,S: We define µσ 1,S on S as follows: µσ 1,S(x1,i) = 1 d + (1/2) ϵρ d for i = 1,..., d µσ 1,S(x2,i) = 1 d for i = 1,..., d and µσ 1,S(x0) = 0. Note that unlike previous part, all the distributions µσ 1S are the same for different σ s. Verifying E1,T (h S,α) . For every pair of (µσ 1,S,µσ 1,T ) we have E1,T (h S,α) d verifying the transfer distance condition and reducing to a packing parts follow the same idea. We just bound the corresponding k L-divergence. KL divergence bound. Define Πσ = (µσ 1,S)n S (µσ 1,T )n T . We have Dkl(Πσ Πσ ) = n S Dkl(µσ 1,S µσ 1S) + n T Dkl(µσ 1,T µσ 1,T ) Since source distributions are the same, the first term is zero. Following the same argument we get Dkl(µσ 1,T µσ 1,T ) c0ϵ2 c0c1 d n T where c0 is the same numerical constant used in (C.1). Then for sufficiently small c1 we get Dkl(Πσ Πσ ) 1 8 log M which satisfies condition (ii) in Proposition 5.1.