# metalearning_for_relative_densityratio_estimation__8f0d8155.pdf Meta-Learning for Relative Density-Ratio Estimation Atsutoshi Kumagai NTT Computer and Data Science Laboratories atsutoshi.kumagai.ht@hco.ntt.co.jp Tomoharu Iwata NTT Communication Science Laboratories tomoharu.iwata.gy@hco.ntt.co.jp Yasuhiro Fujiwara NTT Communication Science Laboratories yasuhiro.fujiwara.kh@hco.ntt.co.jp The ratio of two probability densities, called a density-ratio, is a vital quantity in machine learning. In particular, a relative density-ratio, which is a bounded extension of the density-ratio, has received much attention due to its stability and has been used in various applications such as outlier detection and dataset comparison. Existing methods for (relative) density-ratio estimation (DRE) require many instances from both densities. However, sufficient instances are often unavailable in practice. In this paper, we propose a meta-learning method for relative DRE, which estimates the relative density-ratio from a few instances by using knowledge in related datasets. Specifically, given two datasets that consist of a few instances, our model extracts the datasets information by using neural networks and uses it to obtain instance embeddings appropriate for the relative DRE. We model the relative density-ratio by a linear model on the embedded space, whose global optimum solution can be obtained as a closed-form solution. The closed-form solution enables fast and effective adaptation to a few instances, and its differentiability enables us to train our model such that the expected test error for relative DRE can be explicitly minimized after adapting to a few instances. We empirically demonstrate the effectiveness of the proposed method by using three problems: relative DRE, dataset comparison, and outlier detection. 1 Introduction The ratio of two probability densities, called a density-ratio, has been used in various applications such as outlier detection [14, 1], dataset comparison [49], covariate shift adaptation [42], change point detection [30], positive and unlabeled (PU) learning [19, 18], density estimation [46], and generative adversarial networks [47]. Thus, density-ratio estimation (DRE) is attracting a lot of attention. A naive approach to DRE is to estimate each density and then take the ratio. However, this approach does not work well since density estimation is a hard problem [48]. Therefore, direct DRE without going through density estimation has been extensively studied [44, 17, 35, 13]. Although direct DRE is useful, its fundamental weakness is that the density-ratio is unbounded, i.e., it can take infinity, which causes stability issues [29]. To cope with this problem, a relative density-ratio has been proposed, which is a smoothed and bounded extension of the density-ratio [49]. In the above applications, the density-ratio can be replaced with the relative density-ratio, and relative DRE has shown excellent performance [49, 30, 6, 37, 47]. Existing methods for (relative) DRE require many instances from both densities. However, sufficient instances are often unavailable for various reasons. For example, it is difficult to instantly collect many instances from new data sources such as new users or new systems. Collecting instances is 35th Conference on Neural Information Processing Systems (Neur IPS 2021). expensive in some applications such as clinical trials or crash tests, where DRE can be used for dataset comparison to investigate the effect of drugs/car conditions. In such cases, existing methods cannot work well. Learned model Training phase Test phase Source dataset 1 Source dataset D Target dataset A Target dataset B Relative density-ratio for target datasets r (x) = p A(x) Figure 1: Our problem formulation. In a training phase, our model is trained with source datasets. In a test phase, the learned model estimates relative density-ratio rα(x) = p A(x) αp A(x)+(1 α)p B(x), (0 α < 1) with target datasets A and B that are generated from densities p A(x) and p B(x), respectively. In this paper, we propose a meta-learning method for relative DRE. To estimate the relative density-ratio from a few instances in target datasets, the proposed method uses instances in different but related datasets, called source datasets. When these datasets are related, we can transfer useful knowledge from source datasets to target ones [31]. Figure 1 shows our problem formulation. We model the relative density-ratio by using neural networks that enable us to perform accurate DRE thanks to their high expressive capabilities. Since each dataset has a different property, incorporating it to the model is essential. To achieve this, given two datasets that consist of a few instances, called support instances, our model first calculates a latent vector representation of each dataset. This vector is calculated by using permutation-invariant neural networks that can take a set of instances as input [50]. Since the vector is obtained from a set of instances in the dataset, it contains information of the dataset. With the two latent vectors of datasets, each instance is non-linearly mapped to an embedding space that is suitable for relative DRE on the datasets. Using the embedded instances, we perform relative DRE, where the relative density-ratio is represented by a linear model on the embedded space. With the squared loss, the global optimal solution of the linear model can be obtained as a closed-form solution, which enables us to perform more stable and faster adaptation to support instances than numerical solutions. The neural networks of our model are trained by minimizing the expected test squared error of relative DRE after adapting to support instances that is calculated using instances in the source datasets. Since the closed-form solution of the linear model is differentiable, this training can be performed by gradient-based methods such as ADAM [20]. Since all parameters of our model are shared across all datasets, which enables knowledge to be shared between all datasets, the learned model can be applied to unseen target datasets. This training explicitly improves the relative DRE performance for test instances after estimating the relative density-ratio using support instances. Thus, the learned model can accurately estimate the relative density-ratio from a few instances. Our main contributions are as follows: (1) To the best of knowledge, our work is the first attempt at meta-learning for (relative) DRE. (2) We propose a model that performs accurate relative DRE from a few instances by effectively adapting both embeddings and linear model to the instances. (3) We empirically demonstrate the effectiveness of the proposed method with three popular problems: relative DRE, dataset comparison, and outlier detection. 2 Related Work Many methods for direct DRE have been proposed such as classifier-based methods [4, 33], Kullback Leibler importance estimation [43], kernel mean matching [13], and unconstrained least-squares importance fitting (u LSIF) [17]. Although these methods are useful, they can suffer from the unbounded nature of the density-ratio. That is, an instance that is in the low density region of the denominator density may have an extremely large value of ratio, and DRE can be dominated by such points, which causes robustness and stability issues [29]. This problem is particular serious when using flexible density-ratio models such as neural networks because they try to fit on extremely large density-ratio values [18]. To cope with the unboundedness, the relative u LSIF (Ru LSIF) uses the relative density-ratio to u LSIF [49]. Since Ru LSIF can obtain the optimal parameters as a closed-form solution, it is computationally efficient and stable and performs well in various applications when sufficient instances are available [49, 30, 6, 37, 47]. Since fast and effective adaptation to support in- stances is essential in meta-learning as explained in the end of this section, we incorporate Ru LSIF in our framework. Although neural network-based DRE methods have been recently proposed [18, 35], they cannot perform well when training instances are quite small due to overfitting. Although the proposed method uses neural networks for the instance embeddings, it can accurately perform relative DRE from a few instances by learning how to perform few-shot relative DRE with related datasets. DRE is used for transfer learning or covariate shift adaptation [40, 42, 45, 22, 7]. To transfer knowledge in a training dataset to a test dataset, these methods estimate the density-ratio between training and test densities that is used for weighting labeled training instances. These methods use only two datasets to estimate the density-ratio. In contrast, the proposed method uses multiple datasets to accumulate transferable knowledge and uses it for the relative DRE on two new datasets, which can be used in various applications including covariate shift adaptation as described in Section 1. Meta-learning methods have been recently attracting a lot of attention [8, 41, 9, 34, 3, 16]. These methods train a model such that it generalizes well after adapting to support instances using multiple datasets. In this framework, fast adaptation to support instances is essential since the result of the adaptation is required to train the model in each iteration of training [3, 34]. Encoder-decoder methods such as neural processes [9, 10] perform quick adaptation by forwarding support instances to neural networks. However, they have difficulty working well for any dataset since the adaptation is approximated by only the neural networks. Gradient-based methods such as model-agnostic metalearning (MAML) [8] adapt to support instances by using an iterative gradient descent method and are widely used. These methods require higher-order derivatives and to retain all optimization path of the iterative adaptation to backpropagate through the path, which imposes considerable computational and memory burdens [3]. Thus, they must keep the iteration number small and it prevents effective adaptation. In contrast, the proposed method quickly and effectively adapt to support instances by solving a convex optimization problem, where the global optimum solution can be quickly obtained as a closed-form solution. Although few methods adapt to support instances by solving convex optimization problems for fast and effective adaptation [3, 26], they consider classification tasks. To the best of our knowledge, no meta-learning methods have been designed for DRE, and thus, existing meta-learning methods cannot be applied to our problems. 3 Preliminary We briefly explain a relative density-ratio. Suppose that instances {xn}N n=1 are drawn from a distribution with density pnu(x) and instances {x n}N n=1 are drawn from another distribution with density pde(x). Density ratio r(x) is defined by r(x) := pnu(x) pde(x). Here, nu and de indicate the numerator and the denominator. The aim of DRE is to directly estimate r(x) from both instances {xn}N n=1 and {x n}N n=1. However, r(x) is unbounded and thus can take extremely large values when the denominator pde(x) takes a small value. This causes robustness and stability issues [29]. To deal with this problem, a relative density-ratio has been proposed [49]. For 0 α < 1, relative densityratio rα(x) is define by rα(x) := pnu(x) αpnu(x)+(1 α)pde(x). Relative density-ratio rα(x) is bounded since rα(x) 1 α for any x. rα(x) is always smoother than r(x). rα(x) can replace r(x) and is used for various applications [49, 30, 6, 37, 47]. When α = 0, rα(x) is reduced to density-ratio r(x). Thus, the relative density-ratio can be regarded as a smoothed and bounded extension of the density-ratio. 4 Proposed Method 4.1 Problem Formulation Let Xd = {xdn}Nd n=1 be a d-th dataset and xdn RM be the M-dimensional feature vector of n-th instance in the d-th dataset. Instances {xdn}Nd n=1 are drawn from a distribution with density pd. We assume that feature dimension M is the same across all datasets, but each distribution can differ. Suppose that D datasets X := {Xd}D d=1 are given at the training phase. Our goal is to estimate a relative density-ratio rα(x) from two target datasets that consist of a few instances, Sdnu = {xdnun}Ndnu n=1 and Sdde = {xdden} Ndde n=1 , where dnu and dde are not included in {1. . . . , D}, that are given at the test phase. In this subsection, we use notations Snu and Sde instead of Sdnu and Sdde, respectively for simplicity. Similarly, we use pnu and pde instead of pdnu and pdde, respectively. We explain our model that estimates the relative density-ratio from S = Snu Sde, called support instances. First, our model calculates a latent representation of each dataset using permutation-invariant neural networks [50]: RK, zde := g where f and g are any feed-forward neural network. Since summation is permutation-invariant, the neural network in Eq. (1) outputs the same vector even though the order of instances in each dataset varies. Thus, the neural network in Eq. (1) is well defined as functions for set inputs. Since latent vector z is calculated from the set of instances in a dataset, z contains information of the empirical distribution of instances in the dataset. The proposed method can use any other permutation-invariant function such as summation [50] and set transformer [25] to obtain latent vectors of datasets. The proposed method models the relative density-ratio by the following neural network, ˆrα(x; S) := w h([x, znu, zde]), (2) where [ , , ] is a concatenation of vectors, h : RM+2K RT >0 is a feed-forward neural network, and w RT 0 is linear weights. The non-negativeness of both the outputs of h and w ensures the non-negativeness of the estimated relative density-ratio. h([x, znu, zde]) represents the embedding of instance x. Since h([x, znu, zde]) depends on both znu and zde, the embeddings reflect the characteristics of two datasets. Such embeddings are learned by using source datasets X so that they lead to accurate DRE given the target datasets, which will be described in the subsection 4.3. Linear weights w are determined so that the following expected squared error between true relative density-ratio rα(x) and estimated relative density-ratio ˆrα(x; S), Jα, is minimized: 2Eqα(x) h (rα(x) ˆrα(x; S))2i 2 Epnu(x) ˆrα(x; S)2 + 1 α 2 Epde(x) ˆrα(x; S)2 Epnu(x) [ˆrα(x; S)] + Const., (3) where E is expectation, qα(x) := αpnu(x) + (1 α)pde(x), and Const is a constant term that does not depend on our model. By approximating the expectation with support instances S and excluding the non-negative constraints for w, we obtained the following optimization problem: w := arg min w RT 2w Kw k w + λ 2 w w , (4) where k = 1 |Snu| P x Snu h([x, znu, zde]) and K = α |Snu| P x Snu h([x, znu, zde])h([x, znu, zde]) + 1 α |Sde| P x Sde h([x, znu, zde])h([x, znu, zde]) . In Eq. (4), the third term of r.h.s. is the ℓ2-regularizer to prevent over-fitting, and λ > 0 is a positive real number. The global optimum solution for Eq. (4) can be obtained as the following closed-form solution: w = (K + λI) 1 k, (5) where I is the T dimensional identity matrix. This closed-form solution can be efficiently obtained when T is not large. Note that (K + λI) 1 exists since λ > 0 makes (K + λI) positive-definite. Some learned weights w can be negative. To compensate for this, following previous studies [17], the solution is modified as ˆw = max(0, w), where max operator is applied for each element of w. The closed-form solution enables fast and effective adaptation to support instances S. By using the learned weights, the relative density-ratio estimated with support instances S can be obtained as ˆr α(x; S) := ˆw h([x, znu, zde]). (6) 4.3 Training We explain the training procedure for our model. In this subsection, symbols S = Snu Sde are used as support instances in source datasets. In our model, the parameters to be estimated, Θ, are Algorithm 1 Training procedure of our model. Require: Source datasets X, support instance size NS, query instance size NQ, relative parameter α Ensure: Parameters of our model Θ 1: repeat 2: Sample two datasets d and d from {1, . . . , D} with replacement 3: Select support instances Snu and Sde with size NS from Xd and Xd , respectively 4: Select query instances Qnu and Qde with size NQ from Xd and Xd , respectively 5: Calculate linear weights w with the support instances by Eq. (5) to obtain Eq. (6) 6: Calculate the loss Jα(Q; S) in Eq. (8) with the query instances 7: Update parameters with the gradients of the loss Jα(Q; S) 8: until End condition is satisfied; neural network parameters f, g, h, and regularizer parameter λ. We estimate these parameters by minimizing the expected test squared error of relative DRE given support instances, where support instances S = Snu Sde and test instances Q = Qnu Qde, called query instances, are randomly generated from source datasets X: Ed,d {1,...,D} h E(Snu,Sde),(Qnu,Qde) Xd Xd h Jα(Q; S) ii , (7) where (U, V ) Xd Xd denotes that instances U and V are selected from Xd and Xd , respectively, and Jα(Q; S) is the approximation of expected squared error Jα with query instances Q, Jα(Q; S) = α 2|Qnu| x Qnu ˆr α(x; S)2 + 1 α x Qde ˆr α(x; S)2 1 |Qnu| x Qnu ˆr α(x; S). (8) The pseudocode for our training procedure is illustrated in Algorithm 1. For each iteration, we randomly select two datasets with replacement (Line 2). From the datasets, we randomly select support instances S = Snu Sde and query instances Q = Qnu Qde (Lines 3 4). We then calculate the relative density-ratio with the support instances (Line 5). Using the estimated relative density-ratio, we calculate loss Jα(Q; S) with the query instances (Line 6). Lastly, the parameters of our model are updated with the gradient of the loss (Line 7). This training procedure trains the parameters of our model so as to explicitly improve the relative DRE performance after estimating the relative density-ratio with a few instances. Thus, the learned model makes accurate DRE from target support instances. Since the close-form solution for adaptation in Eq. (5) is differentiable w.r.t. the model parameters, this training can be performed by using gradient-based methods such as ADAM [20] . Although we use the squared error for the objective function of query instances in Eq. (8), our framework can use any differentiable loss function for query instances, such as Kullback Leibler divergence [43] since we do not require closed-form solutions for query instances. A more intuitive explanation of the proposed method is described in the supplemental material. 5 Experiments In this section, we demonstrate the effectiveness of the proposed method with three problems: relative DRE, dataset comparison, and inlier-based outlier detection. All experiments were conducted on a Linux server with an Intel Xeon CPU and a NVIDIA Ge Force GTX 1080 GPU. 5.1 Proposed Method Settings For all problems, a three(two)-layered feed-forward neural network was used for f (g) in Eq. (1). For f, the number of output and hidden nodes was 100, and Re LU activation was used. For h in Eq. (2), a three-layered feed-forward neural network with 100 hidden nodes with Re LU activation and 100 output nodes (T = 100) with the Softplus function was used. Hyperparameters were determined based on the empirical squared error for relative DRE on validation data. The dimension of latent vectors z was chosen from {4, 8, 16, 32, 64, 128, 256}. Relative parameter α was set to 0.5, which is a value recommended in a previous study [49]. We used the Adam optimizer [20] with a learning rate of 0.001. The mini-batch size was set to 256 (i.e., NQ = 128 for numerator and denominator instances). In training with source datasets, support instances are included in query instances as Figure 2: Illustrating examples of relative density-ratio estimation when 10 support instances are used in each target dataset. Horizontal and vertical axes represent x and relative density-ratio values, respectively. Blue line denotes true relative density-ratio. Orange, green, and red lines represent estimated relative density-ratios by Ours, Ru LSIF, and Ru LSIF-FT, respectively. in [9, 10]. The squared error on validation data was used for early stopping to avoid over-fitting, where the maximum number of training iterations was 10,000. This setup was used for all neural network-based methods in subsequent subsections. We implemented all methods by Py Torch [32]. 5.2 Relative Density-ratio Estimation We evaluate the relative DRE performance of the proposed method. We evaluated the squared error in Eq. (3) with α = 0.5 on test instances ignoring the constant term that does not depend on models. Data We used one synthetic data and two real-word benchmark data (Mnist-r1 and Isolet2), which have been commonly used in transfer or multi-task learning studies [11, 27, 23]. In the synthetic data, each dataset Xd was generated from a one-dimensional Gaussian distribution N(µd, σ2 d). Datasetspecific mean µd and standard deviation σd were uniform randomly selected from [ 1.5, 1.5] and [0.1, 2], respectively. Each dataset consists of 300 instances that are generated from each distribution. We created 600 source, 3 validation, 20 target datasets and evaluated the mean test squared error of all target dataset pairs when the number of target support instances was NS = 10. Mnist-r, which was derived from MNIST, consists of images. Mnist-r has six tasks, where each task is created by rotating the images in multiples of 15 degrees: 0, 15, 30, 45, 60, and 75. Each task has 1000 images, which are represented by 256-dimensional vectors, of 10 classes (digits). Isolet consists of letters spoken by 150 speakers, and speakers are grouped into five groups (tasks) by speaking similarity. Each instance is represented as a 617-dimensional vector. The number of classes (letters) is 26. For both benchmark data, we treat each class of each task as a dataset, and thus, Mnist-r and Isolet have 60 and 130 datasets, respectively. We randomly chose one task and then chose 10 target datasets from the task. From the remaining datasets, we randomly chose 10 validation sets and used the remaining as source datasets. We created 10 different splits of source/validation/target datasets and evaluated the mean test squared error of all target dataset pairs. Comparison methods We compared the proposed method with Ru LSIF [49] and Ru LSIF-FT. Ru LSIF trains a kernel model with only target support instances for relative DRE. We used the Gaussian kernel and Gaussian width was set to the median distance between support instances, which is a useful heuristic (median trick) [39]. Since neural network-based models performed poorly in our experiments due to small instances, we used the kernel model as in the original paper. Ru LSIF-FT uses a neural network for modeling the relative density-ratio. Ru LSIF-FT pretrains the model using source datasets and fine-tunes the weights of the last layer with target support instances. Note that the pretrained model in Ru LSIF-FT cannot estimate the relative density-ratio for target datasets without fine-tuning. We used the same network architecture as the proposed method, i.e., the four-layered feed-forward neural network. For Ru LSIF and Ru LSIF-FT, regularization parameter λ was chosen from {0.0001, 0.001, 0.01, 0.1, 1}, and the best test results were reported. Results Figure 2 shows five illustrating examples of relative DRE in the synthetic data. The proposed method was able to accurately estimate the relative density-ratio from small target support instances. The mean test squared errors without the constant term of Ours, Ru LSIF, and Ru LSIF-FT were -0.613, -0.559, and -0.551, respectively (lower is better). Table 1 shows the mean test squared 1 https://github.com/ghif/mtae 2 http://archive.ics.uci.edu/ml/datasets/ISOLET Table 1: Results for relative DRE: Average test squared errors ignoring the constant term with different target support instance sizes. Boldface denotes the best and comparable methods according to the paired t-test (p = 0.05). Ru LSIF Data NS Ours Ru LSIF -FT Mnist 1 -0.671 -0.543 -0.503 -r 2 -0.748 -0.581 -0.513 3 -0.772 -0.600 -0.518 4 -0.784 -0.597 -0.520 5 -0.793 -0.578 -0.520 Avg. -0.754 -0.580 -0.515 Isolet 1 -0.873 -0.656 -0.508 2 -0.893 -0.676 -0.512 3 -0.900 -0.693 -0.514 4 -0.903 -0.695 -0.514 5 -0.905 -0.683 -0.515 Avg. -0.895 -0.681 -0.513 Table 2: Results for dataset comparison: Average test AUCs [%] with different target support instance sizes. Boldface denotes the best and comparable methods according to the paired t-test (p = 0.05). Ru LSIF D3RE Data NS Ours Ru LSIF u LSIF D3RE MMD -FT -FT Mnist 1 83.53 47.86 55.01 64.88 45.14 69.43 63.83 -r 2 93.00 84.46 78.09 73.26 85.09 78.28 68.29 3 93.86 87.83 78.81 75.30 89.18 85.66 67.96 4 96.49 93.90 83.51 80.19 93.08 87.31 71.79 5 97.54 98.23 90.66 82.96 98.01 86.09 78.62 Avg. 92.88 82.46 77.22 75.32 82.10 81.36 70.10 Isolet 1 96.28 50.18 59.38 81.48 48.11 79.50 81.50 2 98.32 94.28 89.70 88.20 94.57 81.62 87.76 3 99.23 97.22 94.27 89.69 97.79 85.93 88.39 4 99.37 99.01 96.54 91.80 98.96 83.57 91.12 5 99.60 99.61 96.77 93.86 99.43 83.19 92.74 Avg. 98.56 88.07 87.33 89.00 87.77 82.76 88.30 Table 3: Ablation study for relative DRE. Average test squared errors without the constant term over different target support instance sizes. No No No Sadapt Data Ours Latent Sadapt -FT Mnist-r -0.754 -0.739 -0.724 -0.663 Isolet -0.895 -0.888 -0.856 -0.723 Avg. -0.825 -0.814 -0.790 -0.693 Table 4: Ablation study for dataset comparison. Average test AUCs [%] over different target support instance sizes. No No No Sadapt Data Ours Latent Sadapt -FT Mnist-r 92.88 91.66 88.46 89.92 Isolet 98.56 98.31 94.73 87.49 Avg. 95.72 94.99 91.60 88.71 errors ignoring the constant term with different target support instance sizes in Mnist-r and Isolet. The proposed method clearly outperformed Ru LSIF and Ru LSIF-FT. Since Ru LSIF does not use source datasets, it performed worse than the proposed method. Although Ru LSIF-FT uses source datasets, it did not work well since it does not have mechanisms for few-shot relative DRE. In contrast, the proposed method trains the model so that it explicitly improves test relative DRE performance after adapting to a few instances, and thus, it worked well. Table 3 shows the results of an ablation study of the proposed method. No Latent is our model without latent vectors for datasets z. No Sadapt is our model without adapting to support instances with the closed-form solution ˆw in Eq. (5). No Sadapt learns dataset-invariant linear weights w, and uses only latent vectors of target datasets z to estimate the relative density-ratio for the datasets. Thus, No Sadapt can be categorized into encode-decoder meta-learning methods. No Sadapt-FT finetunes the liner weights w in the model learned by No Sadapt with target support instances. Note that our model without both latent vectors and adapting to support instances cannot perform relative DRE for target datasets because it cannot take any information of target datasets. The details of these models are explained in the supplemental material. Although all methods performed better than Ru LSIF and Ru LSIF-FT, the proposed method outperformed the others. This result indicates that considering both latent vectors and adaptation to support instances is useful in our framework. 5.3 Dataset Comparison We evaluate the proposed method with a dataset comparison problem. The aim of this problem is to determine if two datasets that consist of a few instances come from the same distribution. The proposed method outputs the score of whether two datasets come from the same distribution by calculating relative Pearson (PE) divergence [49], which is calculated using the relative density-ratio. Data We used Mnist-r and Isolet described in the previous subsection. We regard two datasets as coming from the same distribution if they are from the same class of the same task. We used all target dataset pairs for evaluation. Since the numbers of the same and different pairs in the target datasets are imbalanced (10 same and 90 different pairs), we used the area under ROC curve (AUC) as an evaluation metric because it can property evaluate the performance in imbalanced classification problems [2]. Comparison methods We compared the proposed method with six methods: Ru LSIF, u LSIF [17], deep direct DRE (D3RE) [18], maximum mean discrepancy (MMD) [12], Ru LSIF-FT, and D3REFT. u LSIF is a DRE method, which is equivalent to Ru LSIF with α = 0. For Ru LSIF, u LSIF, and Ru LSIF-FT, the setting is the same as those of subsection 5.2. D3RE is a recently proposed neural network-based DRE method. We used the LSIF-based loss function and the same network architectures as the proposed method, i.e., the four-layered feed-forward neural network. D3RE-FT pretrains the model with source datasets and fine-tunes the weights of the last layer with target support instances. For D3RE and D3RE-FT, hyperparameter C was chosen from {0.1, 0.5, 1, 10}, and the best test results were reported. MMD is a non-parametric distribution discrepancy metric, which is widely used since it can compare distributions without density estimation. We used the Gaussian kernel and Gaussian width was determined by the median trick. For the proposed method, Ru LSIF, and Ru LSIF-FT, relative PE divergence was used for the distribution discrepancy metric. For u LSIF, D3RE, and D3RE-FT, PE divergence was used. Although the proposed method, Ru LSIF-FT, and D3RE-FT use source datasets for training, the others do not. Note that no methods use any information of similarity/dissimilarity of two datasets during training. Results Table 2 shows the mean test AUCs with different target support instance sizes. The proposed method showed the best/comparable results for all cases. Ru LSIF performed better than u LSIF since relative DRE is more stable than DRE with a few instances. D3RE performed worse than the proposed method since target support instances were too small to train its neural network. When support instance size was small, the proposed method outperformed the others by a large margin. This is because it is difficult for Ru LSIF, u LSIF, D3RE, and MMD to compare two datasets from only a few target instances. In contrast, the proposed method was able to accurately compare two datasets from a few instances because it learns to perform accurate relative DRE with a few instances. Since Ru LSIF-FT and D3RE-FT were not trained for few-shot DRE, they did not perform well. Table 4 shows the results of the ablation study. Similar to the results in subsection 5.2, the proposed method performed better than the others by considering both latent vectors and adapting to support instances. 5.4 Inlier-based Outlier Detection We evaluate the proposed method with an inlier-based outlier detection problem. This problem is to find outlier instances in an unlabeled dataset based on another dataset that consists of normal instances. By defining the density-ratio where the numerator and denominator densities are normal and unlabeled densities, respectively, we can see that the density-ratio values for outliers are close to zero since outliers are in a region where normal (unlabeled) density is low (high). Thus, we can use the negative density-ratio value as outlier scores [1, 14]. Similarly, the relative density-ratio values of outliers are close to zero, and thus, we can also use them as outlier scores [49]. In this problem, each dataset consists of normal and unlabeled instances Xd = Xnor d Xun d . Along with this, we use a slightly modified sampling procedure of Algorithm 1. Specifically, for each iteration, we sample one dataset from the source datasets and create support and query instances from the dataset. The details of the algorithm are described in the supplemental material. We assume that the number of target normal support instances N nor S is small since labeling cost is often high in practice such as normal behavior-based outlier systems for new users [24]. Data We used three real-world benchmark data: Io T3, Landmine4, and School5. These benchmark data are commonly used in outlier detection studies [23, 15]. Io T is real network traffic data, which are gathered from nine Io T devices (datasets) infected by malware. Landmine consists of 29 datasets, and each instance is extracted from a radar image that captures a region of a minefield. School contains the examination scores of students from 139 schools (datasets). We picked schools with 100 or more students, ending up with 74 datasets. The average outlier rates in a dataset of Io T, Landmine, and School are 0.05, 0.06, and 0.15, respectively. The details of the benchmark data are described in the supplemental material. For Io T, we randomly chose one target, one validation, and seven source datasets. For Landmine, we randomly chose 3 target, 3 validation, and 23 source datasets. For School, we randomly chose 10 target, 10 validation, and 54 source datasets. For each source/validation dataset in Io T, Landmine, and School, we chose 200, 200, and 50 instances, respectively, as normal and the remaining as unlabeled instances. For each target dataset, we used all instances except for 3 https://archive.ics.uci.edu/ml/datasets/detection_of_Io T_botnet_attacks_N_Ba Io T 4 http://people.ee.duke.edu/ lcarin/Landmine Data.zip 5 http://multilevel.ioe.ac.uk/intro/datasets.html Table 5: Results for inlier-based outlier detection: Average test AUCs [%] with different target normal support instance sizes N nor S . Boldface denotes the best and comparable methods according to the paired t-test (p = 0.05). Second column denotes the number of target normal support instances N nor S . AE SD Rul SIF D3RE Data Ours Ru LSIF u LSIF D3RE AE SD LOF IF AE-S SD-S -FT -FT -FT -FT Io T 1 97.28 95.75 95.87 85.20 93.09 91.90 93.30 41.32 43.59 40.12 66.05 55.50 93.15 84.51 2 97.81 93.96 94.09 87.48 92.41 92.20 93.29 45.84 43.60 34.63 75.21 67.65 88.53 89.57 3 96.39 94.64 95.38 84.33 89.48 90.86 93.34 40.79 43.61 36.57 76.88 72.82 89.38 88.97 4 97.05 94.46 93.43 81.74 90.75 89.76 92.82 42.66 43.60 40.26 80.53 67.27 89.01 89.45 5 96.17 94.43 95.29 81.49 89.01 87.49 92.87 40.14 43.65 38.67 80.25 73.56 88.96 89.12 Avg. 96.94 94.65 94.81 84.05 90.95 90.44 93.12 42.15 43.61 38.05 75.78 67.36 89.80 88.33 Land 1 68.70 53.69 53.84 49.54 52.91 52.60 45.09 55.80 52.79 50.36 55.49 53.95 60.86 52.15 mine 2 64.92 55.56 55.21 52.54 50.39 53.00 45.17 56.08 52.79 51.16 52.08 52.35 62.55 55.15 3 63.66 54.74 54.41 51.84 49.50 49.16 45.17 56.90 52.80 50.97 55.43 53.67 61.36 53.34 4 66.24 54.94 53.98 52.21 49.65 51.74 45.12 55.73 52.80 50.08 55.40 52.40 62.04 51.90 5 63.05 55.46 53.42 53.32 50.88 51.08 45.19 56.35 52.79 50.08 54.43 54.34 62.62 54.46 Avg. 65.31 54.88 54.17 51.89 50.67 51.52 45.15 56.17 52.80 50.53 54.37 53.34 61.89 53.40 Sch 1 62.98 55.00 54.99 53.05 56.27 54.63 53.94 57.44 58.32 56.36 59.07 56.26 56.26 52.46 ool 2 62.18 56.34 54.81 53.79 57.20 56.24 53.73 57.11 58.27 56.64 59.27 58.53 56.02 53.47 3 64.30 56.69 55.93 54.81 57.54 56.71 54.26 57.10 58.28 56.51 59.51 57.25 55.36 54.28 4 63.70 58.15 57.42 54.78 58.46 57.63 54.10 57.09 58.25 56.15 59.70 55.66 55.60 55.61 5 64.61 57.76 57.71 54.89 58.54 57.01 54.12 56.92 58.24 56.75 59.33 56.84 56.11 56.02 Avg. 63.55 56.79 56.17 54.26 57.60 56.45 54.03 57.13 58.27 56.48 59.38 56.91 55.87 54.37 target normal support instances as unlabeled instances (test instances). For each benchmark data, we randomly created 10 different splits of target/validation/source datasets and evaluated the mean test AUC on the target datasets. Comparison methods We compare the proposed method with 13 outlier detection methods: Ru LSIF, u LSIF [14], D3RE, local outlier factor (LOF) [5], isolation forest (IF) [28], autoencoder (AE) [38], deep support vector description (SD) [36], AE-S, SD-S, fine-tuning methods for AE and SD (AE-FT and SD-FT), Ru LSIF-FT, and D3RE-FT. LOF and IF use only target unlabeled instances to find outliers. AE and SD use target normal instances for training. AE-S and SD-S use source normal instances for training. AE-FT and SD-FT fine-tune models trained by AE-S and SD-S with target normal instances, respectively. Note that although AE and SD-based methods can use unlabeled instances as well as normal instances for training, they performed worse than them trained with only normal instances. Thus, we used only normal instances for training. Ru LSIF, u LSIF, and D3RE use target normal and unlabeled instances. Ru LSIF-FT and D3RE-FT use source normal and unlabeled instances as well as target normal and unlabeled instances. Note that no methods use any information of outliers for training. The details of the implementation such as neural network architectures and hyperparameter candidates are described in the supplemental material. Table 6: Ablation study for outlier detection. Average test AUCs [%] over different target normal support instance sizes. No No No Sadapt Data Ours Latent Sadapt -FT Io T 96.94 97.63 95.17 94.44 Landmine 65.31 63.08 68.83 65.83 School 63.55 63.40 60.04 62.21 Avg. 75.27 74.70 74.49 74.16 Results Table 5 shows the mean test AUCs with different target normal support instance sizes. The proposed method showed the best/comparable results for all cases. Density-ratio methods such as Ru LSIF and u LSIF tended to show better results than the other comparison methods by using information from both target normal and unlabeled instances. The proposed method was able to further improve performance than Ru LSIF and u LSIF by incorporating the mechanism for few-shot relative DRE. Table 6 shows the results of an ablation study. The best method can vary across benchmark data since each benchmark data has different properties. For example, in Io T, all datasets are relatively similar, which is validated by the fact that test AUCs are high [23]. Thus, the dataset-invariant embedding function h in No Latent is sufficient for adaptation. Nevertheless, the proposed method (Ours) showed the best average AUCs over all benchmark data. In the supplemental material, we additionally showed that the proposed method outperformed the PU learning method [21]. 5.5 Dependency of Relative Parameter α We investigated the dependency of relative parameter α > 0 in the proposed method. Relative parameter α determines the upper bound of relative density-ratio value since rα(x) 1 Table 7: Investigation of dependency of relative parameter α in the proposed method. Values in relative DRE represent average test squared errors ignoring constant terms over different target support instance sizes and all benchmark data. Values in dataset comparison and outlier detection represent average test AUCs [%] over different target support instance sizes and all benchmark data. Ours Ours Ours Ru LSIF Ru LSIF Ru LSIF Problem α=0.1 α=0.5 α=0.9 α=0.1 α=0.5 α=0.9 relative DRE -2.80 -0.83 -0.53 -0.64 -0.63 -0.48 dataset comparison 92.81 95.72 95.03 85.51 85.27 86.34 outlier detection 74.77 75.27 74.98 68.47 68.77 68.69 Table 8: Computation time in seconds for each method on dataset comparison. Ours (train) and Ru LSIF-FT (train) represent training time with source datasets for Ours and Ru LSIF-FT, respectively. Ours (test), Ru LSIF-FT (test), Ru LSIF, u LSIF, and MMD represent test time for 100 target dataset comparisons. Ours (train) Ours (test) Ru LSIF u LSIF MMD Ru LSIF-FT (train) Ru LSIF-FT (test) 137.93 0.32 0.38 0.34 0.12 50.32 0.24 x. Table 7 shows results with α = 0.1, 0.5, and 0.9 for all three problems. The proposed method consistently outperformed Ru LSIF over different α values. This result suggests that the proposed method is relatively robust against the relative parameter value. Note that, in relative DRE, comparison between different α values is meaningless since the scale of the evaluation metrics is different. Besides, various additional results such as investigation of the dependency of the dimensions of latent vectors are described in the supplemental material. 5.6 Computation Cost We investigated the computation time of the proposed method. Table 8 shows the computation time of each method for dataset comparison with Mnist-r. We used a computer with a 2.20GHz CPU. The support instance size in each target dataset was set to five. We omitted D3RE since it requires to train the neural network for each target dataset comparison, which is quite time-consuming compared to the others. Although the proposed method took time for training with source datasets, it was able to compare datasets with relative DRE as fast as other methods. 6 Limitations The proposed method uses multiple source datasets to improve relative DRE performance on target datasets. However, when source and target datasets are significantly different, there is a risk of degrading the performance on the target datasets. This phenomenon is called negative transfer , and is a common challenge in general transfer/meta-learning methods. Developing methods to automatically remove negative effects of such datasets is one of the important research directions. 7 Conclusion In this paper, we proposed a meta-learning method for relative DRE. We empirically showed that the proposed method outperformed various existing methods in three problems: relative DRE, dataset comparison, and outlier detection. As future work, we plan to incorporate other DRE models such as telescopic DRE [35] in our framework. We describe a potential negative social impact of our work. The proposed method needs to access datasets obtained from multiple sources like almost all meta-learning methods. When each dataset is provided from different owners, sensitive information in the dataset risks being stolen and abused by malicious people. To evade this risk, we encourage research for developing meta-learning methods without accessing raw datasets. [1] M. Abe and M. Sugiyama. Anomaly detection by deep direct density ratio estimation. 2019. [2] M. Bekkar, H. K. Djemaa, and T. A. Alitouche. Evaluation measures for models assessment over imbalanced data sets. J Inf Eng Appl, 3(10), 2013. [3] L. Bertinetto, J. F. Henriques, P. H. Torr, and A. Vedaldi. Meta-learning with differentiable closed-form solvers. ar Xiv preprint ar Xiv:1805.08136, 2018. [4] S. Bickel, M. Brückner, and T. Scheffer. Discriminative learning for differing training and test distributions. In ICML, 2007. [5] M. M. Breunig, H.-P. Kriegel, R. T. Ng, and J. Sander. Lof: identifying density-based local outliers. In SIGMOD, 2000. [6] B. Dong, Y. Gao, S. Chandra, and L. Khan. Multistream classification with relative density ratio estimation. In AAAI, 2019. [7] T. Fang, N. Lu, G. Niu, and M. Sugiyama. Rethinking importance weighting for deep learning under distribution shift. In Neur IPS, 2020. [8] C. Finn, P. Abbeel, and S. Levine. Model-agnostic meta-learning for fast adaptation of deep networks. In ICML, 2017. [9] M. Garnelo, D. Rosenbaum, C. Maddison, T. Ramalho, D. Saxton, M. Shanahan, Y. W. Teh, D. Rezende, and S. A. Eslami. Conditional neural processes. In ICML, 2018. [10] M. Garnelo, J. Schwarz, D. Rosenbaum, F. Viola, D. J. Rezende, S. Eslami, and Y. W. Teh. Neural processes. ar Xiv preprint ar Xiv:1807.01622, 2018. [11] M. Ghifary, W. Bastiaan Kleijn, M. Zhang, and D. Balduzzi. Domain generalization for object recognition with multi-task autoencoders. In ICCV, 2015. [12] A. Gretton, K. M. Borgwardt, M. J. Rasch, B. Schölkopf, and A. Smola. A kernel two-sample test. The Journal of Machine Learning Research, 13(1):723 773, 2012. [13] A. Gretton, A. Smola, J. Huang, M. Schmittfull, K. Borgwardt, and B. Schölkopf. Covariate shift by kernel mean matching. Dataset shift in machine learning, 3(4):5, 2009. [14] S. Hido, Y. Tsuboi, H. Kashima, M. Sugiyama, and T. Kanamori. Statistical outlier detection using direct density ratio estimation. Knowledge and information systems, 26(2):309 336, 2011. [15] T. Idé, D. T. Phan, and J. Kalagnanam. Multi-task multi-modal models for collective anomaly detection. In ICDM, 2017. [16] T. Iwata and A. Kumagai. Meta-learning from tasks with heterogeneous attribute spaces. In Neur IPS, 2020. [17] T. Kanamori, S. Hido, and M. Sugiyama. A least-squares approach to direct importance estimation. The Journal of Machine Learning Research, 10:1391 1445, 2009. [18] M. Kato and T. Teshima. Non-negative bregman divergence minimization for deep direct density ratio estimation. In ICML, 2021. [19] M. Kato, T. Teshima, and J. Honda. Learning from positive and unlabeled data with a selection bias. In ICLR, 2018. [20] D. P. Kingma and J. Ba. Adam: a method for stochastic optimization. ar Xiv preprint ar Xiv:1412.6980, 2014. [21] R. Kiryo, G. Niu, M. C. d. Plessis, and M. Sugiyama. Positive-unlabeled learning with nonnegative risk estimator. In Neur IPS, 2017. [22] A. Kumagai and T. Iwata. Learning latest classifiers without additional labeled data. In IJCAI, 2017. [23] A. Kumagai, T. Iwata, and Y. Fujiwara. Transfer anomaly detection by inferring latent domain representations. In Neur IPS, 2019. [24] T. D. Lane. Machine learning techniques for the computer security domain of anomaly detection. 2002. [25] J. Lee, Y. Lee, J. Kim, A. Kosiorek, S. Choi, and Y. W. Teh. Set transformer: A framework for attention-based permutation-invariant neural networks. In ICML, 2019. [26] K. Lee, S. Maji, A. Ravichandran, and S. Soatto. Meta-learning with differentiable convex optimization. In CVPR, 2019. [27] C. Li, J. Yan, F. Wei, W. Dong, Q. Liu, and H. Zha. Self-paced multi-task learning. In AAAI, 2017. [28] F. T. Liu, K. M. Ting, and Z.-H. Zhou. Isolation forest. In ICDM, 2008. [29] S. Liu, A. Takeda, T. Suzuki, and K. Fukumizu. Trimmed density ratio estimation. In Neur IPS, 2017. [30] S. Liu, M. Yamada, N. Collier, and M. Sugiyama. Change-point detection in time-series data by relative density-ratio estimation. Neural Networks, 43:72 83, 2013. [31] S. J. Pan and Q. Yang. A survey on transfer learning. IEEE Transactions on knowledge and data engineering, 22(10):1345 1359, 2009. [32] A. Paszke, S. Gross, S. Chintala, G. Chanan, E. Yang, Z. De Vito, Z. Lin, A. Desmaison, L. Antiga, and A. Lerer. Automatic differentiation in pytorch. 2017. [33] J. Qin. Inferences for case-control and semiparametric two-sample density ratio models. Biometrika, 85(3):619 630, 1998. [34] A. Rajeswaran, C. Finn, S. M. Kakade, and S. Levine. Meta-learning with implicit gradients. In Neur IPS, 2019. [35] B. Rhodes, K. Xu, and M. U. Gutmann. Telescoping density-ratio estimation. In Neur IPS, 2020. [36] L. Ruff, N. Görnitz, L. Deecke, S. A. Siddiqui, R. Vandermeulen, A. Binder, E. Müller, and M. Kloft. Deep one-class classification. In ICML, 2018. [37] T. Sakai and N. Shimizu. Covariate shift adaptation on learning from positive and unlabeled data. In AAAI, 2019. [38] M. Sakurada and T. Yairi. Anomaly detection using autoencoders with nonlinear dimensionality reduction. In Proceedings of the MLSDA 2014 2nd Workshop on Machine Learning for Sensory Data Analysis, page 4. ACM, 2014. [39] B. Schölkopf, A. J. Smola, F. Bach, et al. Learning with kernels: support vector machines, regularization, optimization, and beyond. MIT press, 2002. [40] H. Shimodaira. Improving predictive inference under covariate shift by weighting the loglikelihood function. Journal of statistical planning and inference, 90(2):227 244, 2000. [41] J. Snell, K. Swersky, and R. Zemel. Prototypical networks for few-shot learning. In Neur IPS, 2017. [42] M. Sugiyama, M. Krauledat, and K.-R. MÞller. Covariate shift adaptation by importance weighted cross validation. Journal of Machine Learning Research, 8(May):985 1005, 2007. [43] M. Sugiyama, S. Nakajima, H. Kashima, P. Buenau, and M. Kawanabe. Direct importance estimation with model selection and its application to covariate shift adaptation. In Neur IPS, 2007. [44] M. Sugiyama, T. Suzuki, and T. Kanamori. Density ratio estimation in machine learning. Cambridge University Press, 2012. [45] M. Sugiyama, T. Suzuki, S. Nakajima, H. Kashima, P. von Bünau, and M. Kawanabe. Direct importance estimation for covariate shift adaptation. Annals of the Institute of Statistical Mathematics, 60(4):699 746, 2008. [46] H. Takahashi, T. Iwata, Y. Yamanaka, M. Yamada, and S. Yagi. Variational autoencoder with implicit optimal priors. In AAAI, 2019. [47] M. Uehara, I. Sato, M. Suzuki, K. Nakayama, and Y. Matsuo. Generative adversarial nets from a density ratio estimation perspective. ar Xiv preprint ar Xiv:1610.02920, 2016. [48] V. Vapnik. Statistical learning theory. Wiley, 1:624, 1998. [49] M. Yamada, T. Suzuki, T. Kanamori, H. Hachiya, and M. Sugiyama. Relative density-ratio estimation for robust distribution comparison. Neural computation, 25(5):1324 1370, 2013. [50] M. Zaheer, S. Kottur, S. Ravanbakhsh, B. Poczos, R. R. Salakhutdinov, and A. J. Smola. Deep sets. In Neur IPS, 2017.