# density_ratio_estimation_with_doubly_strong_robustness__5568b96e.pdf Density Ratio Estimation with Doubly Strong Robustness Ryosuke Nagumo 1 2 Hironori Fujisawa 3 1 We develop two density ratio estimation (DRE) methods with robustness to outliers. These are based on the divergence with a weight function to weaken the adverse effects of outliers. One is based on the Unnormalized Kullback-Leibler divergence, called Weighted DRE, and its optimization is a convex problem. The other is based on the γ-divergence, called γ-DRE, which improves a normalizing term problem of Weighted DRE. Its optimization is a DC (Difference of Convex functions) problem and needs more computation than a convex problem. These methods have doubly strong robustness, which means robustness to the heavy contamination of both the reference and target distributions. Numerical experiments show that our proposals are more robust than the previous methods. 1. Introduction Density ratio estimation (DRE) is a statistical method to directly estimate the ratio of two probability density functions without estimating each function (Nguyen et al., 2007; Sugiyama et al., 2012b). DRE is used in many applications, including change detection (Kawahara & Sugiyama, 2009; Liu et al., 2013), outlier detection (Hido et al., 2011), covariate shift adaptation (Shimodaira, 2000; Zhang et al., 2023), and two-sample test (Wornowizki & Fried, 2016; Kim et al., 2021). Its parametric formulation is also called the differential graphical model (Liu et al., 2014; 2017a;b). Differential graphical models are used in protein and genetic interaction mapping (Ideker & Krogan, 2012) and brain imaging (Na et al., 2020). Density ratio estimation is not robust when the data exist in a region where the density function values are small (Smola 1The Graduate University for Advanced Studies (SOKENDAI), Tokyo, Japan 2Panasonic Holdings Corporation, Osaka, Japan 3Institute of Statistical Mathematics, Tokyo, Japan. Correspondence to: Ryosuke Nagumo . Proceedings of the 41 st International Conference on Machine Learning, Vienna, Austria. PMLR 235, 2024. Copyright 2024 by the author(s). et al., 2009; Yamada et al., 2011). Consider a density ratio r(x) = p(x)/q(x), where we call p(x) a reference distribution and q(x) a target distribution. When the data exist in a region where p(x) or q(x) is small, the density ratio r(x) tends to be wrongly estimated. This situation occurs, for example, when p(x) and q(x) have no common support, which is a usual case in the high-dimensional setting (Rhodes et al., 2020; Kato & Teshima, 2021; Choi et al., 2021; 2022; Srivastava et al., 2023). Our interest is in the case where outliers contaminate the main distributions (Maronna et al., 2006; Hampel et al., 2011). Because outliers exist in a region where the density functions of the main distribution are small, the outliers have adverse effects on the estimation of the density ratio. For density estimation, many robust methods have been proposed, including the Huber loss (Huber, 1964), the trimming estimator (Hadi & Luceno, 1997), the density power divergence (Basu et al., 1998), and the γ-divergence (Fujisawa & Eguchi, 2008). These robust estimation methods realized real-world applications, including yeast gene expression (Yang & Lozano, 2015) and gene function regulation (Hirose et al., 2017). For the outlier-robust density ratio estimation, only the limited research has been done as far as we know (Sugiyama et al., 2012a). The most promising method is Trimmed DRE (Liu et al., 2017c). This method assumes that outliers have larger density ratio values than inliers. It also assumes that only the reference dataset is contaminated while the target dataset remains clean. Under these assumptions, Trimmed DRE is shown to be robust experimentally and theoretically. Although Trimmed DRE is shown to be robust, there are some limitations to its robustness. The density ratio values of outliers do not necessarily have larger values than inliers. That breaks the assumption that this method relies on. Besides, Trimmed DRE is not robust to the contamination of the target dataset, which restricts its usage in real-world applications. For example, in the change point detection in time series data, a sequence is divided into subsequences and assigned to the reference and target datasets sequentially (Liu et al., 2017a). Then, both the reference and target datasets can include outliers. The robustness to the reference contamination alone is insufficient for time series applications. Density Ratio Estimation with Doubly Strong Robustness To overcome these limitations, we develop density ratio estimation methods with doubly strong robustness, where doubly means two types of contamination of the reference and target datasets, and strong means independence from the contamination ratio. We propose Weighted DRE and γDRE, which realize doubly strong robustness. Both methods employ a weight function which weakens the adverse effects of outliers. Weighted DRE minimizes the Unnormalized Kullback-Leibler (UKL) divergence, and the minimization is a convex problem. γ-DRE minimizes the γ-divergence to overcome a drawback of Weighted DRE in estimating the normalizing term. Its minimization is a DC (Difference of Convex functions) problem but needs more computation than Weighted DRE. These methods firstly achieve doubly strong robustness as far as we know. This paper is organized as follows. Section 2 introduces the conventional DRE method from the viewpoint of the density ratio function and the discrepancy measure. In Section 3, we propose two DRE methods with robustness, Weighted DRE and γ-DRE, and theoretically show that they have doubly strong robustness. In Section 4, numerical experiments illustrate that the proposed methods are more robust than the past ones. 2. Density Ratio Estimation The density ratio is defined as the ratio of two density functions. Let p(x) and q(x) be the strictly positive density functions of the reference and target datasets, respectively, for x Rd. The true density ratio can be written as r(x) = p(x)/q(x) : Rd 7 R. To estimate the density ratio, we employ the density ratio function rβ(x), where β is a parameter, and then measure the discrepancy between the true density ratio r(x) and the density ratio function rβ(x). The choice of the density ratio function and the discrepancy measure realizes the various DRE methods. 2.1. Density Ratio Function The formulation of the density ratio function is categorized into three patterns: parametric models, non-parametric models, and deep models. For the parametric models (Liu et al., 2014; 2017b), the density ratio function is defined as rθ,C(x) = Crθ(x) = C exp θT h(x) , (1) where θ Rp is the difference parameter, C R is the normalizing term, and h(x) : Rd 7 Rp is the feature transform function. A detailed discussion of this formulation will be given in Appendix A. This parametric function is useful for sparse estimation in high-dimensional settings (Liu et al., 2017b). For the non-parametric models, the density ratio function can be defined by the linear model (Sugiyama et al., 2008; Kanamori et al., 2009; 2012) as rlm θ (x) = θT ψ(x), where θ Rb is a weight parameter and ψ(x) : Rd 7 Rb is a basis function. Another formulation is the log-linear model (Tsuboi et al., 2009; Kanamori et al., 2010) defined as rllm θ,C(x) = C exp θT ψ(x) . The standard choice of the basis function is θT ψ(x) = Pnp i=1 θi K(x, x(p) i ), where K(x, x ) is the Gaussian kernel and x(p) i for i = 1, ..., np are the data points in the reference dataset. These non-parametric models are useful for complex distributions (Sugiyama et al., 2012a). For the deep models, some density ratio methods using deep neural nets have been proposed (Rhodes et al., 2020; Kato & Teshima, 2021; Choi et al., 2021; 2022; Srivastava et al., 2023). These methods are useful for high-dimensional and unstructured data such as images. 2.2. Discrepancy Measure Statistical divergences are reasonable choices to measure the discrepancy between the true density ratio r(x) and the density ratio function rβ(x). The most famous one is the Bregman (BR) divergence (Bregman, 1967; Sugiyama et al., 2012a; Kato & Teshima, 2021). Let f be a differentiable and strictly convex function with the derivative f. Then, we quantify the discrepancy of r(x) and rβ(x) as = Z [f (r(x)) f (rβ(x)) f (rβ(x)) (r(x) rβ(x))]q(x)dx = Z [ f (rβ(x)) rβ(x) f (rβ(x))]q(x)dx Z f (rβ(x)) p(x)dx + const. The Bregman divergence is a general expression, and the choice of f realizes various methods. For example, the UKL (Unnormalized Kullback-Leibler) divergence (Nguyen et al., 2007) and KLIEP (Kullback-Leibler Importance Estimation Procedure) (Sugiyama et al., 2008) adopt f(t) = t log t t, LSIF (Least-Squares Importance Fitting) (Kanamori et al., 2009) and KMM (Kernel Mean Matching) (Gretton et al., 2009) adopt f(t) = (t 1)2/2, and the BKL (Binary Kullback-Leibler) divergence (Hastie et al., 2001) adopts f(t) = t log t (1 + t) log(1 + t). Given two datasets, {x(p) i }np i=1 i.i.d. p(x) for the reference and {x(q) i }nq i=1 i.i.d. q(x) for the target, the Bregman Density Ratio Estimation with Doubly Strong Robustness divergence without the constant term is approximated by ˆDBR(r, rβ) n f rβ(x(q) i ) rβ(x(q) i ) f rβ(x(q) i ) o i=1 f rβ(x(p) i ) . 3. Robust Density Ratio Estimation Our idea to achieve a robust estimation is to introduce a weight function w(x) : Rd 7 R+ to reduce the adverse effects of outliers in the estimator (Maronna et al., 2006). We propose Weighted DRE and γ-DRE to realize this idea. 3.1. Weighted DRE 3.1.1. FORMULATION The Bregman divergence can include the weight function as the base measure: DBR(r, rβ; w) = Z [ f (rβ(x)) rβ(x) f (rβ(x))]q(x)w(x)dx Z f (rβ(x)) p(x)w(x)dx + const. With the base measure w(x)dx, the Bregman divergence still has the following properties: (i) DBR(r, rβ; w) 0, (ii) DBR(r, rβ; w) = 0 r = rβ. Some combinations of the density ratio function rβ and the convex function f can realize the robust estimation. Here, we show an example of the parametric DRE with the UKL divergence. When adopting the parametric density ratio function (1) and f(t) = t log t t, the formulation of the UKL divergence can be given as DUKL(r, rθ,C; w) = Z rθ,Cq(x)w(x)dx Z log rθ,Cp(x)w(x)dx = C Z exp θT h(x) q(x)w(x)dx Z θT h(x) + log C p(x)w(x)dx + const. Because (2) is convex about C, the optimal normalizing term is R w(x)p(x)dx R exp(θT h(x))w(x)q(x)dx. (3) Then, the parameter θ is estimated by minimizing DUKL(r, rθ; w) = DUKL(r, rθ,C ; w) = Z θT h(x)w(x)p(x)dx + Z w(x)p(x)dx log Z exp(θT h(x))w(x)q(x)dx + const. This UKL divergence is empirically approximated without the constant term by two datasets {x(p) i }np i=1 and {x(q) i }nq i=1: ˆDUKL(r, rθ; w) = 1 i=1 θT h(x(p) i )w(x(p) i ) i=1 w(x(p) i ) log 1 i=1 exp(θT h(x(q) i ))w(x(q) i ). This objective function is convex about θ and can be minimized via gradient descent. For sparse estimation, we can add a regularization term, for example, λ1 θ 1 + λ2 θ 2 2 with the positive tuning parameters λ1 and λ2 (Zou & Hastie, 2005). The more efficient optimization method via the Lagrangian dual problem is discussed in Appendix B.1. We call this estimation procedure Weighted DRE. 3.1.2. DOUBLY STRONG ROBUSTNESS Let us consider the robustness of the estimator given by minimizing the UKL divergence (4). Suppose that the reference and target datasets are contaminated by outliers, more precisely, drawn from the contaminated distributions (Huber, 2004; Maronna et al., 2006; Hampel et al., 2011) given by p (x) = (1 εp)p (x) + εpδp(x), q (x) = (1 εq)q (x) + εqδq(x), respectively, where p (x) and q (x) are the true density functions, δp(x) and δq(x) are the density functions of outliers, and εp and εq (0 εp < 1, 0 εq < 1) are the contamination ratios of outliers. The contaminated density ratio is r (x) = p (x)/q (x), and the true density ratio is r (x) = p (x)/q (x). The target parameter of the minimization problem (4) can have a latent bias in the contaminated setting. Because we only have the contaminated density functions p (x) and q (x), the target parameter in the contaminated setting is θ UKL = argmin θ DUKL(r , rθ; w). This is different from the true target parameter θ UKL = argmin θ DUKL(r , rθ; w). Density Ratio Estimation with Doubly Strong Robustness The robust statistics aims to make the latent bias θ UKL θ UKL sufficiently small (Huber, 2004; Hampel et al., 2011). To achieve the small bias, we assume that the weight function can ignore the outlier distributions. Assumption 3.1. Suppose that θ exists in a compact convex set Θ. Let fq(x, θ) = exp(θT h(x))w(x)δq(x), ν1 = Z w(x)δp(x)dx, ν2 = Z θT h(x)w(x)δp(x)dx, ν3 = Z fq(x, θ)dx, ν4 = Z h(x)fq(x, θ)dx, ν5 = Z h(x)h(x)T fq(x, θ)dx. Let Aj be the index set of the elements of νj, and ν = max{supθ Θ |νja|}j=1....,5;a Aj. Then, ν is sufficiently small. Let us consider an example where Assumption 3.1 is satisfied. Example 3.2. Let δa(x) be a Dirac delta function at a. Consider the well-used outlier distributions: δp(x) = δx(p) o (x) and δq(x) = δx(q) o (x) with outliers x(p) o and x(q) o . Let h(x) = (x1x1, x1x2, . . . , xdxd)T , which is usually used in Gaussian distributions and Ising models (Liu et al., 2017b). Let w(x) = exp( x 4 4), where x 4 4 = x4 1 + x4 2 + ... + x4 d. If x(p) o and x(q) o are strong outliers, more precisely, if x(p) o and x(q) o are sufficiently large, this example satisfies Assumption 3.1 because ν1 = w(x(p) o ), ν2 = θT h(x(p) o )w(x(p) o ), ν3 = exp(θT h(x(q) o ))w(x(q) o ), ν4 = h(x(q) o )ν3, ν5 = h(x(q) o )h(x(q) o )T ν3. Importantly, the weight function w(x) should converge to zero more rapidly than the feature transform function h(x) and the parametric density ratio function rθ(x) = exp(θT h(x)). An example of the practical choice of the weight function is discussed in Appendix C. Under Assumption 3.1, the following theorem holds. Theorem 3.3. Under Assumption 3.1, we have DUKL(r , rθ; w) = (1 εp) {DUKL(r , rθ; w) + const + O(εrν)} , εr = max εp 1 εp , εp log(1 εq) 1 εp , εq 1 εq Furthermore, we assume that θ UKL and θ UKL are unique interior points in Θ. Then, we have θ UKL θ UKL = O(εrν). The outline of the proof is owing to (Fujisawa & Eguchi, 2008). The proof is given in Appendix D.1. This theorem implies that the latent bias θ UKL θ UKL can be sufficiently small even when the contamination ratios εp and εq are not small. This property can be called strong robustness, which is considered for one contaminated distribution in the past papers (Fujisawa & Eguchi, 2008; Hirose et al., 2017; Hung et al., 2018; Kawashima & Fujisawa, 2023). The above theorem shows doubly strong robustness because there are two contaminated distributions. Although Weighted DRE seems reasonable, it has a slight drawback in estimating the normalizing term C in (3). This normalizing term consists of the reference and target dataset and may not be precise due to randomness. The normalizing term C is a nuisance parameter, and our interest is only in estimating the difference parameter θ. In this section, we propose another DRE method without estimating the normalizing term C. 3.2.1. FORMULATION The γ-divergence (Fujisawa & Eguchi, 2008) is a popular method to estimate a density function robustly under heavily contaminated conditions. Let f(x) and g(x) be strictly positive functions, not necessarily density functions, and γ be a positive constant. The γ-divergence Dγ(g, f) is written as Dγ(g, f) = dγ(g, f) dγ(g, g), where dγ(g, f) is the γ-cross entropy: dγ(g, f) = 1 γ log Z g(x)f(x)γdx + 1 1 + γ log Z f(x)1+γdx. We propose γ-DRE, which minimizes the γ-cross entropy (or the γ-divergence) between the true density ratio r(x) and the parametric density ratio function rθ,C(x) = Crθ(x) = Density Ratio Estimation with Doubly Strong Robustness C exp(θT h(x)) with the base measure of w(x)q(x)dx: dγ(r, rθ,C; wq) γ log Z rθ,C(x)γw(x)p(x)dx + 1 1 + γ log Z rθ,C(x)1+γw(x)q(x)dx γ log Z exp(γθT h(x))w(x)p(x)dx + 1 1 + γ log Z exp((1 + γ)θT h(x))w(x)q(x)dx =dγ(r, rθ; wq). The γ-divergence with the base measure w(x)q(x)dx, that is, Dγ(r, rθ; wq) = dγ(r, rθ; wq) dγ(r, r; wq), satisfies the following: (i) Dγ(r, rθ; wq) Dγ(r, r; wq), (ii) Dγ(r, rθ; wq) = 0 rθ(x) = α r(x), where α > 0 is a constant. The property (ii) is slightly different from the conventional one, which implies that the normalizing term C vanishes in (6). The above objective function is empirically approximated by two datasets {x(p) i }np i=1 and {x(q) i }nq i=1: ˆdγ(r, rθ; wq) i=1 exp(γθT h(x(p) i ))w(x(p) i ) + 1 1 + γ log 1 i=1 exp((1 + γ)θT h(x(q) i ))w(x(q) i ) g1(θ) + g2(θ). (7) The minimization about θ in (7) is not a convex problem but a DC (Difference of Convex functions) problem because g1(θ) and g2(θ) are convex. This problem can be iteratively solved by the dual problem of Fenchel-Rockafellar (Dinh & Thi, 1997): θ(k) = argmin θ g2(θ) θT g1(θ(k 1)), (8) where k is an iteration number. At each iteration, the objective function in (8) is convex and can be minimized by optimization methods, including gradient descent. The more efficient algorithm in the high-dimensional setting will be discussed in Appendix B.2. 3.2.2. DOUBLY STRONG ROBUSTNESS γ-DRE also has doubly strong robustness. We assume a similar property of the weight function to Assumption 3.1. Assumption 3.4. Suppose that θ exists in a compact convex fgm(x, θ) = exp((m + γ)θT h(x))w(x)δg(x), ν 1 = Z fp0(x, θ)dx, ν 2 = Z fq1(x, θ)dx, ν 3 = Z h(x)fp0(x, θ)dx, ν 4 = Z h(x)fq1(x, θ)dx, ν 5 = Z h(x)h(x)T fp0(x, θ)dx, ν 6 = Z h(x)h(x)T fq1(x, θ)dx. Let A j be the index set of the elements of ν j, and ν = max{supθ Θ |ν ja|}j=1....,6;a A j. Then, ν is sufficiently small. The same example as in Example 3.2 satisfies Assumption 3.4, if γ is set to a moderate value, for instance, γ (0, 1] (Fujisawa & Eguchi, 2008). Under Assumption 3.4, the following theorem holds. Theorem 3.5. Under Assumption 3.4, we have dγ(r , rθ; wq ) = dγ(r , rθ; wq ) + const + O(ε rν ), ε r = max εp 1 εp , εq 1 εq θ γ = argmin θ dγ(r , rθ; wq ), θ γ = argmin θ dγ(r , rθ; wq ). Furthermore, we assume that θ γ and θ γ are unique interior points in Θ and the Hessian matrix of dγ(r , rθ; wq ) at θ = θ γ is positive definite. Then, we have θ γ θ γ = O(ε rν ). The proof is given in Appendix D.2. γ-DRE also has doubly strong robustness because Theorem 3.5 holds even when the contamination ratio εp and εq are not small. 3.3. Related Works The most promising robust DRE method is Trimmed DRE (Liu et al., 2017c). It trims outliers by assuming that they have larger values of the density ratio function than inliers. The estimator of Trimmed DRE is given by optimizing the weighted KL divergence, which reduces into a min-max problem of a convex function: max θ min w [0,1]np,<1,w>=νnp i=1 wi log rθ,C (x(p) i ), Density Ratio Estimation with Doubly Strong Robustness where w Rnp is a weight parameter for the reference dataset, ν (0, 1] is a trimming quantile, and C = nq/ Pnq i=1 exp(θT h(x(q) i )). In estimating w given θ, wis with the top 100(1 ν) % of log rθ,C (x(p) i ) values become 0, and the rest becomes 1/np. In estimating θ given w, data with wi = 0 is removed from the maximization of the empirical density ratio function. Thus, the parameter θ is estimated with only inliers. Although Trimmed DRE is shown to be robust, the assumption it relies on does not hold in some situations. The parametric density ratio function rθ(x) exp(θT h(x)) does not have large values for some outliers. For simplicity, let us consider the cases where x, θ, βp, βq R, θT h(x) = θx, θ = βp βq, and the outlier is xo = c (c > 0). We have the following three cases when c : (i) When βp > βq, rθ(xo) exp(|θ|c) . (ii) When βp < βq, rθ(xo) exp( |θ|c) 0. (iii) When βp = βq, rθ(xo) exp(0) const. These cases indicate that Trimmed DRE is not robust when rθ(x) (c ). A similar discussion can be given when θT h(x) = Pd u,v=1 θu,vxuxv, a typical setting for multivariate Gaussian distributions and Ising models (Liu et al., 2017b). Another limitation is that Trimmed DRE is robust when outliers contaminate only the reference dataset while keeping the target dataset clean. The normalizing term C is calculated by the target dataset without trimming. If the target dataset is contaminated, the calculation of the normalizing term will be unstable. Our proposals, Weighted DRE and γ-DRE, overcome the above shortages. These methods assume that outlier distributions are negligible in the sense of Assumption 3.1 or Assumption 3.4 and are robust to the contamination of the reference and target datasets. They also need no hyperparameter tuning depending on the true contamination ratio, although Trimmed DRE needs to tune the trimming quantile ν. 4. Numerical Experiments 4.1. Difference between Precision Matrices We estimated the difference between the precision matrices in normal distributions. Let x = (x1, x2)T R2, and the true density functions be denoted by p (x) = fλp(x) and q (x) = fλq(x), where fλ(x) = N the reference and target distributions, respectively. This setting is similar to the previous research (Liu et al., Ground Truth Reference Contamination Target Contamination Double Contamination Trimmed DRE Weighted DRE Figure 1. Estimation of the difference parameter θ1,2 between the off-diagonal elements of the precision matrices in normal distributions. Each column shows the settings of the contamination: clean , reference contamination , target contamination , and double contamination . Each row shows the ground truth and the estimation methods: DRE, Trimmed DRE, Weighted DRE, and γ-DRE. The x-axis and y-axis in each figure indicate the true values of the off-diagonal elements of the reference and target distributions, respectively. 2014; 2017c). The parameters λp and λq ranged from 0.9 to 0.9. The outlier distribution was set to δ(x) = . We prepared four settings of the reference distribution p (x) and the target distribution q (x): clean : p (x) = p (x), q (x) = q (x). reference contamination : p (x) = 0.8p (x) + 0.2δ(x), q (x) = q (x). target contamination : p (x) = p (x), q (x) = 0.8q (x) + 0.2δ(x). double contamination : p (x) = 0.8p (x) + 0.2δ(x), q (x) = 0.8q (x) + 0.2δ(x). The dataset sizes were set to np = nq = 100. We compared four methods (DRE, Trimmed DRE, Weighted DRE, and γ-DRE). The density ratio function was parametrized by θT h(x) = P2 u,v=1 θu,vxuxv. The ground truth is given by θ1,2 = θ2,1 = λq λp and θ1,1 = θ2,2 = 0 Density Ratio Estimation with Doubly Strong Robustness Table 1. MSE of the estimated parameters in the clean setting (the standard deviations are in parentheses). θ1,2 λp λq DRE TRIMMED DRE WEIGHTED DRE γ-DRE 0.0 0.0 0.0 0.0123 (0.01645) 0.0196 (0.0393) 0.0240 (0.0491) 0.0097 (0.0185) 0.4 0.0 0.4 0.0565 (0.06743) 0.0551 (0.0548) 0.0883 (0.0765) 0.0493 (0.0400) 0.8 -0.4 0.4 0.1535 (0.10345) 0.1428 (0.0966) 0.2349 (0.1351) 0.1387 (0.0850) 1.2 -0.8 0.4 0.3035 (0.61568) 0.2363 (0.2536) 0.5086 (0.2216) 0.2631 (0.1855) 1.6 -0.8 0.8 1.5608 (2.75867) 1.4278 (2.6771) 0.7467 (0.3520) 0.2931 (0.2971) Table 2. MSE of the estimated parameters in the double contamination setting (the standard deviations are in parentheses). θ1,2 λp λq DRE TRIMMED DRE WEIGHTED DRE γ-DRE 0.0 0.0 0.0 5.3757 (5.5380) 5.5567 (4.8591) 0.0299 (0.0455) 0.0131 (0.0165) 0.4 0.0 0.4 5.5245 (5.5945) 6.0835 (6.4211) 0.0840 (0.0845) 0.0689 (0.0704) 0.8 -0.4 0.4 8.3192 (7.8744) 9.0380 (7.0335) 0.2950 (0.1739) 0.1629 (0.0955) 1.2 -0.8 0.4 8.2561 (7.7996) 10.2449 (8.2807) 0.5750 (0.2867) 0.2175 (0.1636) 1.6 -0.8 0.8 8.4907 (8.8919) 9.3233 (8.7860) 0.9466 (0.5066) 0.3042 (0.2891) because the true density ratio r (x) = p (x)/q (x) exp((λq λp)x1x2). The weight function was set to w(x) = exp x 4 4/50 . The trimming quantile ν in Trimmed DRE was set to the true contamination ratio. The parameter γ in γ-DRE was set to 0.01. No regularization term was added to the objective function. Figure 1 shows that Weighted DRE and γ-DRE successfully estimated the ground truth of θ1,2 in all the contaminated settings. This was because the experimental setting satisfies Assumption 3.1 and Assumption 3.4. DRE failed to estimate the ground truth in the contaminated settings except for a part of the target contamination . In the target contamination setting, DRE estimated the ground truth robustly where θ1,2 < 0 in the lower right of the figure. This was because the influence of the outliers vanished from the second term of the objective function in (5), where exp(θT h(x(q) i )) = exp(P2 u,v=1 θu,vx(q) i,ux(q) i,v) 0 with θ1,2 = θ2,1 < 0, θ1,1 = θ2,2 = 0, and x(q) i,1, x(q) i,2 100. In the reference contamination and the target contamination settings, the estimated values were heavily biased in the opposite direction because the reference and target datasets had the opposite signs in the objective function (5). In the case of double contamination , the color tendency was unclear because, for example, the estimated values tended to be positive/negative when the generated outliers were stronger on the reference/target than the other one. Trimmed DRE successfully estimated the ground truth in the reference contamination setting where θ1,2 > 0 in the upper left of the figure. In this region, the density ratio values of the outliers became large by exp(P2 u,v=1 θu,vxuxv), where θ1,2 = θ2,1 > 0, θ1,1 = θ2,2 = 0, and x1, x2 100. Because the estimator of Trimmed DRE was designed to be robust when outliers had larger density ratio values than inliers, the estimator was robust in that region. In other regions, the estimation performance was similar to that of DRE. These results agree with the theoretical property discussed in Section 3.3. We calculated the mean squared error (MSE) values of the estimated values on some pairs of the true parameters λp and λq. The experiments were repeated 100 times. More details are given in Appendix E, including the extended cases of the parameter pairs and the contamination settings. Tables 1 and 2 show that Weighted DRE and γ-DRE achieved robustness without compromising the precision of the estimation. In the range of 0.0 θ1,2 1.2 in Table 1 (the clean setting), the MSE values of Weighted DRE and γ-DRE were comparable to those of DRE and Trimmed DRE. Besides, at θ1,2 = 1.6, Weighted DRE and γ-DRE had the significantly smaller MSE values than DRE and Trimmed DRE. That might be because the weight function excluded weak outliers sampled from the main body of the density functions. In Table 2 (the double contamination setting), the MSE values of Weighted DRE and γ-DRE were significantly smaller than those of DRE and Trimmed DRE, which shows the robustness of our proposals. Tables 1 and 2 also show that γ-DRE estimated the ground truth slightly better than Weighted DRE. The MSE values of γ-DRE were smaller than those of Weighted DRE in all the settings. γ-DRE did not estimate the normalizing term (3), so the estimated values might have small errors. 4.2. Change Detection in Time Series Data As an experiment with real-world data, we performed a change detection in time series data. We used a human activity dataset provided by the Human Activity Sensing Density Ratio Estimation with Doubly Strong Robustness walk skip stay jog walk st Up stay st Down walk stay skip jog x y z x-x x-y x-z y-y y-z z-z Trimmed DRE x-x x-y x-z y-y y-z z-z Weighted DRE x-x x-y x-z y-y y-z z-z 0 20 40 60 80 100 120 Time x-x x-y x-z y-y y-z z-z Figure 2. Change detection of the human activity sensor data. The top figure shows the sequence of the three-dimensional data (x-, y-, and z-axis). Each figure below the top shows all the estimated parameters of DRE, Trimmed DRE, Weighted DRE, and γ-DRE. The vertical dashed lines are the ground truth of the change points with the category labels of the human activities. Bold dashed lines indicate the changes from and to skip or jog . Consortium (HASC) Challenge 2011, which was used in the experiment of change detection with DRE (Liu et al., 2013). This is a task to segment time series data measured by acceleration sensors according to 6 categories: stay , walk , jog , skip , stair up , and stair down . The measured data has three dimensions (x-, y-, and z-axis). In the change detection task in time series data, the original sequence was divided and assigned to the reference and target datasets sequentially. The dataset sizes of the reference and target datasets were set to np = nq = 100. We compared four methods (DRE, Trimmed DRE, Weighted DRE, and γ-DRE). The parametric function was designed as θT h(x) = P3 u,v=1 θu,vxuxv. The weight function was set to w(x) = exp x 4 4/5 in Weighted DRE and γDRE. We added the elastic net regularization term λ1 θ 1 + λ2 θ 2 2 with λ1 = λ2 = 0.5 to the objective function. The trimming quantile ν was set to 0.9 in Trimmed DRE. The tuning parameter γ was set to 0.01 in γ-DRE. All the estimated parameters, θu,v for 1 u v 3, were used as the anomaly levels instead of the density ratio values because θu,v = 0 at no-change points. Figure 2 shows that Weighted DRE and γ-DRE successfully detected the change points without disturbance by outliers. The top row of Figure 2 indicates that the changes from and to skip or jog are large and other changes are small. All the methods successfully detected these large change points. DRE falsely detected the outliers at the time around [10, 20] and [100, 110], and Trimmed DRE did so at the time around [10, 12] and [100, 110]. Besides, they estimated larger anomaly levels at the outliers than those at the true change points. Weighted DRE and γ-DRE did not detect these outliers. At the time around 80, Weighted DRE and γDRE could not detect the change point, although DRE and Trimmed DRE could. This was the change point between stair down and walk , but there was no apparent change in the original data. DRE and Trimmed DRE seemed to detect the outlier, not the change point. 5. Conclusions We have presented two density ratio estimation methods, Weighted DRE and γ-DRE. Their estimators are robust even when many outliers contaminate both the reference and target datasets, which is called doubly strong robustness. Theoretical analysis reveals that the weight function should converge to zero more rapidly than the density ratio function. Numerical experiments show that our proposals are more robust than the previous methods for synthetic and realworld datasets. Density Ratio Estimation with Doubly Strong Robustness Impact Statement This paper presents work whose goal is to advance the field of Machine Learning. There are many potential societal consequences of our work, none which we feel must be specifically highlighted here. Basu, A., Harris, I. R., Hjort, N. L., and Jones, M. C. Robust and efficient estimation by minimising a density power divergence. Biometrika, 85(3):549 559, 1998. Boyd, S. and Vandenberghe, L. Convex Optimization. Cambridge University Press, 2004. Bregman, L. M. The relaxation method of finding the common point of convex sets and its application to the solution of problems in convex programming. USSR Computational Mathematics and Mathematical Physics, 7: 200 217, 1967. Choi, K., Liao, M., and Ermon, S. Featurized density ratio estimation. In Proceedings of the Thirty-Seventh Conference on Uncertainty in Artificial Intelligence, volume 161, pp. 172 182, 2021. Choi, K., Meng, C., Song, Y., and Ermon, S. Density ratio estimation via infinitesimal classification. In Proceedings of the 25th International Conference on Artificial Intelligence and Statistics (AISTATS) 2022, 2022. Dinh, T. P. and Thi, H. A. L. Convex analysis approach to d.c. programming: Theory, algorithm and applications. 1997. Fujisawa, H. and Eguchi, S. Robust parameter estimation with a small bias against heavy contamination. Journal of Multivariate Analysis, 99(9):2053 2081, 2008. ISSN 0047-259X. Gretton, A., Smola, A., Huang, J., Schmittfull, M., Borgwardt, K., and Sch olkopf, B. Covariate shift by kernel mean matching. Dataset Shift in Machine Learning, pp. 131 160, 2009. Hadi, A. S. and Luceno, A. Maximum trimmed likelihood estimators: a unified approach, examples, and algorithms. Computational Statistics & Data Analysis, 25(3):251 272, 1997. Hampel, F. R., Ronchetti, E. M., Rousseeuw, P. J., and Stahel, W. A. Robust Statistics: the approach based on influence functions. John Wiley & Sons, 2011. Hastie, T., Tibshirani, R., and Friedman, J. The elements of statistical learning: Data mining, inference, and prediction. Springer, 2001. Hido, S., Tsuboi, Y., Kashima, H., Sugiyama, M., and Kanamori, T. Statistical outlier detection using direct density ratio estimation. Knowledge and Information Systems, 26:309 336, 2011. Hirose, K., Fujisawa, H., and Sese, J. Robust sparse gaussian graphical modeling. Journal of Multivariate Analysis, 161:172 190, 2017. ISSN 0047-259X. Huber, P. J. Robust Estimation of a Location Parameter. The Annals of Mathematical Statistics, 35(1):73 101, 1964. Huber, P. J. Robust Statistics. John Wiley & Sons, 2004. Hung, H., Jou, Z.-Y., and Huang, S.-Y. Robust mislabel logistic regression without modeling mislabel probabilities. Biometrics, 74(1), 2018. Ideker, T. and Krogan, N. J. Differential network biology. Molecular Systems Biology, 8:565, 2012. Kanamori, T., Hido, S., and Sugiyama, M. A least-squares approach to direct importance estimation. Journal of Machine Learning Research, 10(48):1391 1445, 2009. Kanamori, T., Suzuki, T., and Sugiyama, M. Theoretical analysis of density ratio estimation. IEICE Transactions on Fundamentals of Electronics, Communications and Computer Sciences, E93.A(4):787 798, 2010. Kanamori, T., Suzuki, T., and Sugiyama, M. Statistical analysis of kernel-based least-squares density-ratio estimation. Machine Learning, 86:335 367, 2012. Kato, M. and Teshima, T. Non-negative bregman divergence minimization for deep direct density ratio estimation. In Meila, M. and Zhang, T. (eds.), Proceedings of the 38th International Conference on Machine Learning, volume 139 of Proceedings of Machine Learning Research, pp. 5320 5333. PMLR, 18 24 Jul 2021. Kawahara, Y. and Sugiyama, M. Change-point detection in time-series data by direct density-ratio estimation. pp. 389 400, 2009. Kawashima, T. and Fujisawa, H. Robust regression against heavy heterogeneous contamination. Metrika, 86:421 442, 2023. Kim, B., Liu, S., and Kolar, M. Two-Sample Inference for High-Dimensional Markov Networks. Journal of the Royal Statistical Society Series B: Statistical Methodology, 83(5):939 962, 09 2021. ISSN 1369-7412. Liu, S., Yamada, M., Collier, N., and Sugiyama, M. Changepoint detection in time-series data by relative density-ratio estimation. Neural Networks, 43:72 83, 2013. ISSN 0893-6080. Density Ratio Estimation with Doubly Strong Robustness Liu, S., Quinn, J., Gutmann, M., Suzuki, T., and Sugiyama, M. Direct learning of sparse changes in markov networks by density ratio estimation. Neural computation, 26: 1169 1197, 03 2014. Liu, S., Fukumizu, K., and Suzuki, T. Learning sparse structural changes in high-dimensional markov networks: A review on methodologies and theories. Behaviormetrika, 44:265 296, 2017a. Liu, S., Suzuki, T., Relator, R., Sese, J., Sugiyama, M., and Fukumizu, K. Support consistency of direct sparsechange learning in Markov networks. The Annals of Statistics, 45(3):959 990, 2017b. Liu, S., Takeda, A., Suzuki, T., and Fukumizu, K. Trimmed density ratio estimation. In Guyon, I., Luxburg, U. V., Bengio, S., Wallach, H., Fergus, R., Vishwanathan, S., and Garnett, R. (eds.), Advances in Neural Information Processing Systems, volume 30. Curran Associates, Inc., 2017c. Maronna, R. A., Martin, R. D., and Yohai, V. J. Robust Statistics: Theory and Methods. John Wiley & Sons, 2006. Na, S., Kolar, M., and Koyejo, O. Estimating differential latent variable graphical models with applications to brain connectivity. Biometrika, 108(2):425 442, 09 2020. ISSN 0006-3444. Nguyen, X., Wainwright, M. J., and Jordan, M. Estimating divergence functionals and the likelihood ratio by penalized convex risk minimization. In Platt, J., Koller, D., Singer, Y., and Roweis, S. (eds.), Advances in Neural Information Processing Systems, volume 20. Curran Associates, Inc., 2007. Rhodes, B., Xu, K., and Gutmann, M. U. Telescoping density-ratio estimation. In Larochelle, H., Ranzato, M., Hadsell, R., Balcan, M., and Lin, H. (eds.), Advances in Neural Information Processing Systems, volume 33, pp. 4905 4916. Curran Associates, Inc., 2020. Shimodaira, H. Improving predictive inference under covariate shift by weighting the log-likelihood function. Journal of Statistical Planning and Inference, 90(2):227 244, 2000. ISSN 0378-3758. Smola, A., Song, L., and Teo, C. H. Relative novelty detection. In van Dyk, D. and Welling, M. (eds.), Proceedings of the Twelth International Conference on Artificial Intelligence and Statistics, volume 5 of Proceedings of Machine Learning Research, pp. 536 543, Hilton Clearwater Beach Resort, Clearwater Beach, Florida USA, 16 18 Apr 2009. PMLR. Srivastava, A., Han, S., Xu, K., Rhodes, B., and Gutmann, M. Estimating the density ratio between distributions with high discrepancy using multinomial logistic regression. Transactions on Machine Learning Research, 2023(3): 1 23, 2023. ISSN 2835-8856. Sugiyama, M., Suzuki, T., Nakajima, S., Kashima, H., von B unau, P., and Kawanabe, M. Direct importance estimation for covariate shift adaptation. Annals of the Institute of Statistical Mathematics, 60:699 746, 2008. Sugiyama, M., Suzuki, T., and Kanamori, T. Density-ratio matching under the bregman divergence: a unified framework of density-ratio estimation. Annals of the Institute of Statistical Mathematics, 64:1009 1044, 2012a. Sugiyama, M., Suzuki, T., and Kanamori, T. Density Ratio Estimation in Machine Learning. Cambridge University Press, 2012b. Tsuboi, Y., Kashima, H., Hido, S., Bickel, S., and Sugiyama, M. Direct density ratio estimation for large-scale covariate shift adaptation. Journal of Information Processing, 17:138 155, 2009. Wornowizki, M. and Fried, R. Two-sample homogeneity tests based on divergence measures. Computational Statistics, 31:291 313, 2016. Yamada, M., Suzuki, T., Kanamori, T., Hachiya, H., and Sugiyama, M. Relative density-ratio estimation for robust distribution comparison. Neural Computation, 25:1324 1370, 2011. Yang, E. and Lozano, A. C. Robust gaussian graphical modeling with the trimmed graphical lasso. In Cortes, C., Lawrence, N., Lee, D., Sugiyama, M., and Garnett, R. (eds.), Advances in Neural Information Processing Systems, volume 28. Curran Associates, Inc., 2015. Zhang, Y.-J., Zhang, Z.-Y., Zhao, P., and Sugiyama, M. Adapting to continuous covariate shift via online density ratio estimation. In Oh, A., Neumann, T., Globerson, A., Saenko, K., Hardt, M., and Levine, S. (eds.), Advances in Neural Information Processing Systems, volume 36, pp. 29074 29113. Curran Associates, Inc., 2023. Zou, H. and Hastie, T. Regularization and Variable Selection Via the Elastic Net. Journal of the Royal Statistical Society Series B: Statistical Methodology, 67(2):301 320, 03 2005. ISSN 1369-7412. Density Ratio Estimation with Doubly Strong Robustness A. Discussion of Parametric Density Ratio Function We justify the formulation of the parametric density ratio function (1) in the case of an exponential family. If two density functions belong to an exponential family, more precisely, p(x) = fβp(x) and q(x) = fβq(x), where fβ(x) = exp βT h(x) + b(x) log A(β) , where β Rp, h(x) : Rd 7 Rp, b(x) : Rd 7 R, and A(β) : Rp 7 R, the true density ratio can be written as r(x) = A(βq) A(βp) exp n (βp βq)T h(x) o . (9) In the formulation of the parametric density ratio function, we estimate the difference βp βq directly without estimating each parameter βp and βq (Liu et al., 2014; 2017a). To approximate the true density ratio r(x), the parametric density ratio function rθ,C(x) : Rd 7 R is introduced by rθ,C(x) = C exp θT h(x) . (10) By comparing (9) and (10), the optimal parameters θ and C are given by θ = βp βq, C = A(βq) We can easily show that θ and C are the minimizer of (2). Notably, the parametric density ratio function rθ,C(x) can be applied to any density ratio, even when the density functions p(x) and q(x) do not necessarily belong to an exponential family. In that case, θ and C cannot be written explicitly. B. Efficient Optimization by Dual Problem We show the primal-dual trick enables us the efficient calculation of the estimator of Weighted DRE and γ-DRE, which is scalable to the high-dimensional data (Boyd & Vandenberghe, 2004; Liu et al., 2014). The core of this trick is to avoid calculating the log-expectation term. B.1. Dual Problem of Weighted DRE By adding L1 and L2 norms to (5), the objective function of Weighted DRE, LUKL(θ), can be written as LUKL(θ) = θT ξ + κ log i=1 exp(θT h(x(q) i ))w(x(q) i ) + λ1 θ 1 + λ2 θ 2 2, (11) i=1 h(x(p) i )w(x(p) i ) Rd, κ = 1 i=1 w(x(p) i ) R, and λ1 and λ2 are regularization parameters for L1 and L2 norms, respectively. The idea to convert this objective function (11) into the dual problem is to introduce the variable zi = θT h(x(q) i ) for i = 1, ..., nq in the log-sum-exp term. This equation can be introduced into the original objective function (11) as the constraints in the form of Lagrange multiplier. Here, we can convert the minimization of (11) into min-max problem: minθ,z maxα LUKL(θ, z, α), where LUKL(θ, z, α) = θT ξ + κ log i=1 exp(zi)w(x(q) i ) + λ1 θ 1 + λ2 θ 2 2 (z ΦT θ)T α, (12) where Φ = (h(x(q) 1 ), ..., h(x(q) nq )) Rd nq, and α Rnq is a Lagrange multiplier. We can convert this Lagrange function into the dual function by minimizing θ and z respectively: LUKL(α) := min θ,z LUKL(θ, z, α) = min θ θT (Φα ξ) + λ1 θ 1 + λ2 θ 2 2 + min z i=1 exp(zi)w(x(q) i ) z T α Density Ratio Estimation with Doubly Strong Robustness Firstly, let us consider the first term: ψ1(θ) = θT η + λ1 θ 1 + λ2 θ 2 2, (14) where η = ξ Φα. Since ψ1(θ) is convex, equating the differential ψ1(θ) to zero gives us the minimizer: ˆθj = 1 2λ2 sλ1(ηj), for j = 1, ..., d, (15) where sλ(x) is a soft threshold function: sλ(x) = sgn(x)(|x| λ)+. By substituting the optimal ˆθ into (14), min θ ψ1(θ) = 1 j=1 (|ηj| λ1)2 +. (16) Secondly, let us consider the second term: ψ2(z) = κ log i=1 exp(zi)w(x(q) i ) z T α. (17) Since ψ2(z) is also convex, equating the differential to zero gives us the minimizer ˆz: αi = κ exp(ˆzi)w(x(q) i ) Pnq i=1 exp(ˆzi)w(x(q) i ) , for i = 1, ..., nq. This equation gives us the constraints which α should satisfy: αi > 0 for i = 1, ..., nq and Pnq i=1 αi = κ. By substituting the optimal ˆz into (17), min z ψ2(z) = i=1 αi log αi log w(x(q) i ) + const. (18) By substituting (16) and (18) into (13), we should solve the problem of maxα LUKL(α): LUKL(α) = 1 j=1 (|ηj| λ1)2 + i=1 αi log αi log w(x(q) i ) subject to αi > 0, for i = 1, ..., nq, i=1 αi = κ. This dual function LUKL(α) is concave and can be solved by optimizing methods, including gradient descent. After convergence, we can convert the estimated dual parameter ˆα to the primal parameter ˆθ by (15). Because this formulation changes the estimation problem of ˆθ Rp to that of ˆα Rnq, this primal-dual trick is useful in the high-dimensional setting where p nq. B.2. Dual Problem of γ-DRE The optimization problem of γ-DRE is an iterative minimization of (8). Given the previous estimated parameter ˆθ(k 1), the objective function can be written as Lγ(θ) = θT ξ(k 1) + 1 1 + γ log i=1 exp (1 + γ)θT h x(q) i w x(q) i + λ1 θ 1 + λ2 θ 2 2, exp γ ˆθ(k 1)T h x(p) i w x(p) i Pnp j=1 exp γ ˆθ(k 1)T h x(p) j w x(p) j h x(p) i . By replacing ξ by ξ(k 1) and κ by 1/(1 + γ) in (11), the same discussion in Appendix B.1 holds. Notably, this estimation procedure should be iterated until the estimated parameter ˆθ(k) converges. Density Ratio Estimation with Doubly Strong Robustness C. Choice of Weight Function The weight function is expected to satisfy Assumptions 3.1 or 3.4. In practice, the weight function can be set as w(x) = exp( x Med MADN 4 4/τ), where Med is the median and MADN = Med({|xi Med(D)|}n i=1)/0.675 is the normalized median absolute deviation of the dataset D. This transformation can be interpreted as the robust version of the standardization. The hyper-parameter τ should be selected such that usual samples have large weights and outliers have small weights. Because the value of x Med MADN increases as the dimension size increases, the appropriate value of τ can change. We adopted τ = 50 for the two-dimensional standard Gaussian in Section 4.1. Here, we show how to choose the hyper-parameter τ for the one-dimensional standard Gaussian distribution. Figure 3 shows that τ = 10 is appropriate compared with the probability density function of the standard Gaussian distribution. The weight function with τ = 1 has too-narrow window, which will eliminate the usual samples. The weight function with τ = 100 has too-wide window, which cannot eliminate the outlier samples. 4 3 2 1 0 1 2 3 4 1.0 pdf tau=1 tau=10 tau=100 Figure 3. Comparison between the probability density function of the one-dimensional Gaussian distribution and the weight function with the hyper-parameters of τ = 1, 10, 100. D. Proof of Doubly Strong Robustness D.1. Proof of Theorem 3.3 Lemma D.1. Let f(x) and fα(x) be C2-class real-valued functions on a compact convex set X. Let the Hessian matrices of f(x) and fα(x) be denoted by J(x) and Jβ(x), respectively. Assume that J(x) is positive definite and Jβ(x) is continuous about β. Suppose that α = O(ν) and β = O(ν) for a sufficiently small value ν. Assume sup x X |fα(x) f(x)| = O(ν), (19) sup x X Jβ(x) J(x) max = O(ν). (20) Let x and x α be minimizers of f(x) and fα(x), respectively. Assume that x and x α are unique interior points in X. Then, we have x α x = O(ν). Proof. Let the smallest eigenvalues of J(x) and Jβ(x) over the set X be denoted by a and aβ, respectively. Let c = min{a, min0 β ν0 aβ} with a sufficiently small fixed value ν0 > 0. From the positive definiteness of J(x) and (20), we have c > 0. By Taylor expansion, f(x) = f(x ) + 1 2(x x )T J( x)(x x ) f(x ) + c fα(x) = fα(x α) + 1 2(x x α)T Jβ( xα)(x x α) fα(x α) + c Density Ratio Estimation with Doubly Strong Robustness where x and xα are appropriate values. Using these inequalities, we have f(x α) f(x ) c fα(x ) fα(x α) c c x α x 2 f(x α) f(x ) + fα(x ) fα(x α) |f(x α) fα(x α)| + |fα(x ) f(x )| = O(ν), where the last order holds from (19). The proof is complete. Proof of Theorem 3.3. We have DUKL(r , rθ; w) = Z θT h(x)w(x)p (x)dx + Z w(x)p (x)dx log Z exp(θT h(x))w(x)q (x)dx + const = (1 εp) Z θT h(x)w(x)p (x)dx εpν2 + (1 εp) Z w(x)p (x)dx + εpν1 log (1 εq) Z exp(θT h(x))w(x)q (x)dx + εqν3 =(1 εp) {DUKL(r , rθ; w) + const + O(εrν)} . Let f(θ) = DUKL(r , rθ; w) and fα(θ) = DUKL(r , rθ; w)/(1 εp) + const with α = (νj)3 j=1. These are C2-class real-valued functions on a compact convex set Θ. The above shows (19). Let the Hessian matrices of f(θ) and fα(θ) be denoted by J(θ) and Jβ(θ) with β = (νj)5 j=1, respectively. Jβ(x) is continuous about β. We can easily show that J(θ) is positive definite after simple calculation. The conditions α = O(ν) and β = O(ν) are satisfied from the definition of ν. In a similar manner to the above, we can show (20). The minimizers θ UKL and θ UKL are unique interior points in Θ from the assumption of the theorem. Therefore, all the conditions assumed in Lemma D.1 are satisfied, and we have θ UKL θ UKL = O(εrν). D.2. Proof of Theorem 3.5 Proof. It follows that dγ(r , rθ; w q ) γ log Z exp(γθT h(x))w(x)p (x)dx + 1 1 + γ log Z exp((1 + γ)θT h(x))w(x)q (x)dx γ log (1 εp) Z exp(γθT h(x))w(x)p (x)dx + εpν 1 + 1 1 + γ log (1 εq) Z exp((1 + γ)θT h(x))w(x)q (x)dx + εqν 2 = dγ(r , rθ; w q ) + const + O(ε rν ). Let f(θ) = dγ(r , rθ; w q ) and fα(θ) = dγ(r , rθ; w q ) with α = (ν j)2 j=1. These are C2-class real-valued functions on a compact convex set Θ. The above shows (19). Let the Hessian matrices of f(θ) and fα(θ) be denoted by J(θ) and Jβ(θ) with β = (ν j)6 j=1, respectively. Jβ(x) is continuous about β. The conditions α = O(ν ) and β = O(ν ) are satisfied from the definition of ν . In a similar manner to the above, we can show (20). The minimizers θ γ and θ γ are unique interior points in Θ from the assumption of the theorem. If J(θ) is positive definite, then all the conditions assumed in Lemma D.1 are satisfied and we have θ γ θ γ = O(ε rν ). However, from the assumption of the theorem, we only have the condition that J(θ) is positive definite at θ = θ γ, not over Θ. We need to bridge the gap. From the continuity of J(θ) and (19), we can take a restricted compact convex set Θ Θ such that J(θ) is positive definite over Θ and θ γ, θ γ Θ. Considering Lemma D.1 on the restricted set Θ, we can have θ γ θ γ = O(ε rν ). Density Ratio Estimation with Doubly Strong Robustness E. MSE Values in Section 4.1 We calculated the MSE values between the estimated parameters and the ground truth. The experimental settings are described in Section 4.1. The initial parameters of the iterative calculation were randomly generated from a normal distribution for each experiment. The mean and the standard deviation of the squared errors were calculated. Tables 3, 4, 5, and 6 show the MSE values in the clean , reference contamination , target contamination , and double contamination settings, respectively. Table 3 shows that Weighted DRE and γ-DRE had comparable MSE values to DRE and Trimmed DRE in the range of 1.2 θ1,2 1.2. At θ1,2 = 1.6 and θ1,2 = 1.6, Weighted DRE and γ-DRE had significantly smaller MSE values than DRE and Trimmed DRE. This seems to be because Weighted DRE and γ-DRE eliminated the weak outliers. These tables also show that γ-DRE estimated the ground truth slightly better than Weighted DRE in all the settings. γ-DRE need not estimate the normalizing term, so it could have the lower MSE values. Table 3. MSE of the estimated parameters in the clean setting (the standard deviations are in parentheses). θ1,2 λp λq DRE TRIMMED DRE WEIGHTED DRE γ-DRE -1.6 0.8 -0.8 1.0746 (1.5161) 1.2785 (3.0056) 0.7038 (0.3516) 0.3288 (0.5279) -1.2 0.4 -0.8 0.4389 (0.7856) 0.4844 (0.9658) 0.4534 (0.2723) 0.2797 (0.1731) -0.8 0.4 -0.4 0.1523 (0.1136) 0.1419 (0.0974) 0.2377 (0.1564) 0.1373 (0.0755) -0.4 0.4 0.0 0.0435 (0.0400) 0.0438 (0.0417) 0.0834 (0.0883) 0.0513 (0.0400) 0.0 0.0 0.0 0.0123 (0.0164) 0.0196 (0.0393) 0.0240 (0.0491) 0.0097 (0.0185) 0.4 0.0 0.4 0.0565 (0.0674) 0.0551 (0.0548) 0.0883 (0.0765) 0.0493 (0.0400) 0.8 -0.4 0.4 0.1535 (0.1034) 0.1428 (0.0966) 0.2349 (0.1351) 0.1387 (0.0850) 1.2 -0.8 0.4 0.3035 (0.6157) 0.2363 (0.2536) 0.5086 (0.2216) 0.2631 (0.1855) 1.6 -0.8 0.8 1.5608 (2.7587) 1.4278 (2.6771) 0.7467 (0.3520) 0.2931 (0.2971) Table 4. MSE of the estimated parameters in the reference contamination setting (the standard deviations are in parentheses). θ1,2 λp λq DRE TRIMMED DRE WEIGHTED DRE γ-DRE -1.6 0.8 -0.8 3950298.5757 (22926.6509) 35323.5783 (5264.2799) 0.9017 (0.3773) 0.3218 (0.3032) -1.2 0.4 -0.8 3950542.0491 (25067.6394) 34954.4141 (4712.6414) 0.5781 (0.2530) 0.2431 (0.1692) -0.8 0.4 -0.4 3958872.0875 (23063.6142) 34460.0170 (7649.7992) 0.2463 (0.1283) 0.1531 (0.0870) -0.4 0.4 0.0 3956367.0423 (24930.7475) 25795.7112 (16823.4622) 0.0902 (0.0926) 0.0436 (0.0340) 0.0 0.0 0.0 3959068.1508 (25197.0345) 21001.3299 (19309.3519) 0.0258 (0.0527) 0.0079 (0.0115) 0.4 0.0 0.4 3960324.3266 (24236.2236) 19655.8422 (19797.1869) 0.1094 (0.1430) 0.0548 (0.0427) 0.8 -0.4 0.4 3957741.4148 (23384.0926) 20133.8894 (19859.9808) 0.2834 (0.1815) 0.1421 (0.0773) 1.2 -0.8 0.4 3961153.3083 (25523.0181) 21343.8024 (20202.3100) 0.5734 (0.2484) 0.2712 (0.1711) 1.6 -0.8 0.8 3968357.1323 (24595.6865) 21251.8250 (20522.5385) 0.9429 (0.3470) 0.2951 (0.5538) Table 5. MSE of the estimated parameters in the target contamination setting (the standard deviations are in parentheses). θ1,2 λp λq DRE TRIMMED DRE WEIGHTED DRE γ-DRE -1.6 0.8 -0.8 549087.3957 (499298.7574) 351177.4384 (319271.9316) 0.7067 (0.3616) 0.2708 (0.2885) -1.2 0.4 -0.8 637923.7106 (480891.8776) 357119.1806 (318166.7869) 0.4389 (0.2751) 0.2971 (0.1992) -0.8 0.4 -0.4 507397.3182 (500213.5196) 363576.7793 (317394.2769) 0.2191 (0.1321) 0.1538 (0.0952) -0.4 0.4 0.0 499076.9452 (501614.2931) 306914.5394 (320727.8412) 0.0790 (0.0667) 0.0436 (0.0331) 0.0 0.0 0.0 585877.3689 (481541.5629) 327077.0420 (312615.8724) 0.0186 (0.0274) 0.0122 (0.0184) 0.4 0.0 0.4 564275.2849 (462204.3313) 392242.2477 (307921.5425) 0.0829 (0.0748) 0.0607 (0.0468) 0.8 -0.4 0.4 650588.5428 (358591.2366) 430804.4411 (250226.4449) 0.2494 (0.1304) 0.1332 (0.0845) 1.2 -0.8 0.4 870574.0795 (138590.6119) 539042.4358 (108722.4856) 0.5089 (0.2182) 0.2447 (0.1827) 1.6 -0.8 0.8 891624.2286 (141870.1246) 563625.7989 (96903.0755) 0.7265 (0.3316) 0.2620 (0.2594) Density Ratio Estimation with Doubly Strong Robustness Table 6. MSE of the estimated parameters in the double contamination setting (the standard deviations are in parentheses). θ1,2 λp λq DRE TRIMMED DRE WEIGHTED DRE γ-DRE -1.6 0.8 -0.8 4.9189 (5.4462) 3.5911 (3.7100) 0.9394 (0.3565) 0.2962 (0.4160) -1.2 0.4 -0.8 47.3175 (443.5423) 3.2073 (3.6291) 0.5737 (0.2540) 0.2325 (0.1936) -0.8 0.4 -0.4 4.3060 (4.4582) 4.4132 (4.3617) 0.2862 (0.1567) 0.1344 (0.0937) -0.4 0.4 0.0 5.1599 (5.0391) 4.5406 (4.6529) 0.0962 (0.1021) 0.0463 (0.0391) 0.0 0.0 0.0 5.3757 (5.5380) 5.5567 (4.8591) 0.0299 (0.0455) 0.0131 (0.0165) 0.4 0.0 0.4 5.5245 (5.5945) 6.0835 (6.4211) 0.0840 (0.0845) 0.0689 (0.0704) 0.8 -0.4 0.4 8.3192 (7.8744) 9.0380 (7.0335) 0.2950 (0.1739) 0.1629 (0.0955) 1.2 -0.8 0.4 8.2561 (7.7996) 10.2449 (8.2807) 0.5750 (0.2867) 0.2175 (0.1636) 1.6 -0.8 0.8 8.4907 (8.8919) 9.3233 (8.7860) 0.9466 (0.5066) 0.3042 (0.2891) F. Comparison between Stable and Robust DRE Our problem setting is different from the previously discussed problem setting in (Yamada et al., 2011), which we call stable estimation. The stable estimation tackles the situation where the two density functions have no common supports, which is the usual case in the high-dimensional setting (Rhodes et al., 2020; Kato & Teshima, 2021; Choi et al., 2021; 2022; Srivastava et al., 2023). The robust estimation aims to eliminate the disturbance effects by outliers. For example, when considering the density ratio of the main and outlier distributions mixture, the stable DRE estimates the density ratio of the mixtures, whereas the robust DRE only estimates that of the main distributions. We compare the performance of Ru LSIF (Relative unconstrained Least-Squares Importance Fitting) (Yamada et al., 2011), one of the famous stable non-parametric DRE methods, and Weighted DRE in the case where the different outliers contaminate the main distribution. We prepare three cases: Uncontaminated: p(x) = q(x) = N(0, 1). Same outliers: p(x) = q(x) = 0.8N(0, 1) + 0.2N(15, 1). Different outliers: p(x) = 0.8N(0, 1) + 0.2N(15, 1) and q(x) = 0.8N(0, 1) + 0.2N(10, 1). In all the cases, the ground truth of the density ratio is r(x) = 1. The dataset sizes are np = nq = 100. We used a Python code of Ru LSIF provided on Git Hub 1. The hyper-parameter α in Ru LSIF is set as 0.95, achieving the high stability. Figure 4 shows that Ru LSIF fails to estimate the ground truth in the Different outliers setting, although Weighted DRE can estimate it correctly. In the Different outliers setting, Ru LSIF falsely detects the change of the outlier distributions, that is, N(15, 1) in p(x) and N(10, 1) in q(x). Because this misspecification affects the estimation of the main distributions N(0, 1), the estimated density ratio is different from the ground truth r(x) = 1. 1https://github.com/hoxo-m/densratio_py/ Density Ratio Estimation with Doubly Strong Robustness Uncontaminated Same outliers Different outliers 5 0 5 10 15 20 Weighted DRE 5 0 5 10 15 20 5 0 5 10 15 20 Figure 4. Comparison of the performance between Ru LSIF and Weighted DRE. The x-axis shows the data values of x, and the y-axis shows the values of the density ratio or the density functions. The blue lines show the estimated density ratio values, and the gray dashed lines show the density functions. For all figures, the true density ratio is r(x) = 1.