# learning_latest_classifiers_without_additional_labeled_data__d29e1180.pdf

Learning Latest Classiﬁers without Additional Labeled Data

Atsutoshi Kumagai NTT Secure Platform Laboratories kumagai.atsutoshi@lab.ntt.co.jp

Tomoharu Iwata NTT Communication Science Laboratories iwata.tomoharu@lab.ntt.co.jp

In various applications such as spam mail classiﬁcation, the performance of classiﬁers deteriorates over time. Although retraining classiﬁers using labeled data helps to maintain the performance, continuously preparing labeled data is quite expensive. In this paper, we propose a method to learn classiﬁers by using newly obtained unlabeled data, which are easy to prepare, as well as labeled data collected beforehand. A major reason for the performance deterioration is the emergence of new features that do not appear in the training phase. Another major reason is the change of the distribution between the training and test phases. The proposed method learns the latest classiﬁers that overcome both problems. With the proposed method, the conditional distribution of new features given existing features is learned using the unlabeled data. In addition, the proposed method estimates the density ratio between training and test distributions by using the labeled and unlabeled data. We approximate the classiﬁcation error of a classiﬁer, which exploits new features as well as existing features, at the test phase by incorporating both the conditional distribution of new features and the density ratio, simultaneously. By minimizing the approximated error while integrating out new feature values, we obtain a classiﬁer that exploits new features and ﬁts on the test phase. The effectiveness of the proposed method is demonstrated with experiments using synthetic and real-world data sets.

1 Introduction

The performance of classiﬁers deteriorates over time in various applications. For example, classiﬁers for identifying malicious web sites would become inaccurate since malicious web sites are uninterruptedly created to scam users [Ma et al., 2009]. In activity recognition using sensor data, the classiﬁcation error would increase over time since user activity patterns dynamically change [Abdallah et al., 2012]. For tasks where the performance of classiﬁers deteriorates, retraining classiﬁers is required to maintain the performance.

There have been many methods proposed that retrain classiﬁers, such as online learning [Crammer et al., 2009], forgetting algorithms [Klinkenberg, 2004], and ensemble learning [Wang et al., 2003]. These methods require labeled data to retrain classiﬁers. However, it is quite expensive to continuously prepare labeled data since labels need to be manually assigned by domain experts. In contrasts, unlabeled data can be easily collected. In this paper, we propose a method for learning classiﬁers by using newly obtained unlabeled data as well as labeled data collected beforehand. There are two major reasons for the performance deterioration. The ﬁrst is the emergence of new features that do not appear in the training phase. For example, in spam mail classiﬁcation, spammers endlessly create new spam mails that include words (features) related to new products or services to cheat users. Therefore, new features to discriminate spam mails emerge over time. If we do not retrain the classiﬁer, it would become impossible to classify these mails precisely [Fdez-Riverola et al., 2007]. Figure 1 shows the time variation of the cumulative number of features in the real-world spam data sets (ECUE spam in [Gama et al., 2014]). The number of features increases rapidly over time. The other reason is the change of probability distribution by which data are governed. For example, in a brain computer-interface, the probability distribution of EEG data changes over time since EEG patterns are affected by user attention or fatigue [Li et al., 2010]. If training and test data follow different distributions, the classiﬁcation performance would become poor since the classiﬁer is learned in order to accurately classify samples generated from the training data distribution rather than samples from the test data distribution [Gama et al., 2014; Pan and Yang, 2010]. In our situation, the training data correspond to the labeled data that are collected until a certain time point, and the test data correspond to the unlabeled data that are collected after the time point, which we would like to accurately classify. Figure 2 shows distributions of a real-world spam data set for February of 2003 and October of 2003, where samples are visualized in a two-dimensional space by t-distributed stochastic neighbor embedding (t-SNE) [Van der Maaten and Hinton, 2008]. The distribution for October of 2003, where there are many samples around the upper left, is varied from that on February of 2003, which is concentrated around the lower left. The proposed method learns latest classiﬁers without ad-

Proceedings of the Twenty-Sixth International Joint Conference on Artiﬁcial Intelligence (IJCAI-17)

Figure 1: Time variation of cumulative number of features in a real-world spam data set

Figure 2: Distributions of a real-world spam data set for February of 2003 and October of 2003, where samples are visualized in twodimensional space

Figure 3: Observed and missing values of given data. Shaded and unshaded parts represent missing and existing values, respectively. Labeled samples are drawn from p(xo), and unlabeled samples are drawn from p (xo, xh)

ditional labeled data that overcomes both problems, i.e., the emergence of new features and the change of distributions, simultaneously. Newly obtained unlabeled data contain information on the correlation between new features, which do not appear in the labeled data but appear in the unlabeled data, and existing features, which appear in the labeled data as well as in the unlabeled data. The proposed method models the conditional distribution of new features given existing features by a multivariate normal distribution, which is learned using the unlabeled data. The newly obtained unlabeled data also contain information on the distribution change. We consider a situation called covariate shift , where training and test data follow different distributions but the class label distribution given the features is constant since it has been reported that this assumption matches various real-world problems, such as spam ﬁltering [Bickel and Scheffer, 2007], HIV therapy screening [Bickel et al., 2008], and human activity recognition [Hachiya et al., 2012]. Using the labeled and unlabeled data, the proposed method estimates the density ratio between training and test distributions, which models the distribution change. We approximate the classiﬁcation error of a classiﬁer, which exploits new features as well as existing features, at the test phase by incorporating both the conditional distribution of new features and the density ratio between training and test distribution. The approximation is calculated by integrating out new feature values by using newly obtained unlabeled data and labeled data collected beforehand without additional labeled data. By minimizing the upper bound of the approximated error, we obtain a classiﬁer that exploits new features and ﬁts on the test phase.

2 Related Work

The proposed method utilizes imputation techniques for handling the emergence of new features since it can be integrated with the framework for covariate shift in a principled way. The imputation techniques are typically used for obtaining estimates of missing values. For example, a single imputation method estimates a conditional distribution of missing features given the existing features and replaces missing values by the conditional mean [Donders et al., 2006]. Since single imputation methods treat missing values as ﬁxed data, the uncertainty of the estimated miss-

ing values is not considered. Multiple imputation methods take the uncertainty into account and have ﬂourished in the statistics community [Rubin, 2004]. They generate multiple samples from the estimated conditional distribution, and the multiple imputed data are used for learning classiﬁers. Since many samples would be necessary to represent the uncertainty, the computational cost of multiple imputation methods would be high. In comparison, the proposed method takes the uncertainty into account but is efﬁcient since it does not require multiple samples by analytically integrating out missing values using the conditional distribution. Semi-supervised learning methods that use both labeled and unlabeled data for learning classiﬁers have been developed [Nigam et al., 2000; Grandvalet and Bengio, 2004; Kingma et al., 2014]. Since semi-supervised methods learn classiﬁers by assigning pseudo labels to unlabeled data, new features that appear only in the unlabeled data can be exploited to the classiﬁers. However, the existing imputation and semi-supervised methods does not explicitly consider the emergence of new features over time and does not handle the change of the distribution.

Many methods have been developed for covariate shift. [Zhang et al., 2013] proposed to match data distributions in the Hilbert space by aligning kernel matrices across domains. [Liu and Ziebart, 2014] proposed a minmax approach for learning classiﬁers. Recently, many works have converged to the direction of using importance weights for covariate shift [Shimodaira, 2000; Kanamori et al., 2009; Sugiyama et al., 2013]. This method learns classiﬁers by weighting a training sample with its importance, which is deﬁned by a ratio between the data distribution at the training phase and the data distribution at the test phase. Many methods for estimating importance have been proposed, such as moment matching [Huang et al., 2006], density matching under the Kullback-Leibler divergence [Sugiyama et al., 2008], and least-squares importance ﬁtting to the ratio [Kanamori et al., 2009]. All of these methods learn classiﬁers by using the importance weighted training data. This means that features that appear only in the test data but not in the training data cannot be incorporated into the classiﬁers. The proposed method is the ﬁrst attempt to incorporate new features for importance weighting methods in covariate shift adaptation.

Our task is related to transfer learning and concept drift.

Proceedings of the Twenty-Sixth International Joint Conference on Artiﬁcial Intelligence (IJCAI-17)

Transfer learning utilizes data in a source domain to solve a related problem in a target domain [Pan and Yang, 2010]. In our task, we regard data generated until a certain time point as source domain, and data generated after the time point as the target domain since these data follow different distributions. Although various assumptions for the distribution change have been studied in transfer learning, in this paper, we focus on a situation of covariate shift since it has been reported that it matches various real-world applications [Bickel and Scheffer, 2007; Bickel et al., 2008; Hachiya et al., 2012]. Concept drift is an online supervised scenario, where a data distribution changes over time, and many methods for concept drift have been proposed [Gama et al., 2014; Klinkenberg, 2004; Wang et al., 2003; Haque et al., 2016]. Although these methods for concept drift generally assume that labeled data are sequentially given to update classiﬁers, the proposed method utilizes unlabeled data to ﬁt on the test distribution, and does not require labeled data to update. Methods for predicting future classiﬁers given only labeled data collected until the current time have been proposed [Kumagai and Iwata, 2016; 2017]. Although these methods are designed to maintain the classiﬁcation performance, they do not cope with the problems of the emergence of new features and the distribution change since they do not use any additional training data, which contain rich information for the problems.

3 Proposed Method

We introduce notations and deﬁne the task studied in this paper. Let D:={(xo n, yn)}N n=1 be a set of labeled samples collected until a certain time point, where xo n RDo is the Do-dimensional feature vector of the n-th labeled sample, yn {0, 1} is its class label, and N is the number of the labeled data. We suppose that the training feature vectors {xo n}N n=1 are drawn from a training distribution p(xo). D :={(xo m, xh m)}M m=1 is a set of unlabeled samples collected after the time point, where (xo m, xh m) is the (Do + Dh)- dimensional feature vector of the m-th unlabeled sample, and M is the number of the unlabeled data. The unlabeled data contains new features xh m RDh, which do not appear in the labeled data D since the unlabeled data are collected after the labeled data are collected. The test feature vectors {(xo m, xh m)}M m=1 are drawn from a test distribution p (xo, xh). Figure 3 summarizes observed and missing values in a given data in our task. We consider a situation of covariate shift where training and test data follow different distributions but the conditional distribution of the class label given the feature vector is constant between training and test phases: p(xo, xh) = p (xo, xh), p(y|xo, xh) = p (y|xo, xh). Our goal is to ﬁnd classiﬁer h : RDo+Dh {0, 1}, which can accurately classify samples drawn from p (xo, xh), given the labeled and unlabeled data D D .

3.1 Our Framework

We obtain a classiﬁer by minimizing the following generalization error G at the test phase,

G := Z loss(y, h(xo, xh))p (xo, xh, y)dxodxhdy, (1)

where loss(y, y ) denotes any loss function that outputs a point-wise error when y is predicted by y . The generalization error G is the expectation of the loss function with the test distribution. This generalization error G can be rearranged as

G= Z loss(y, h(xo, xh))p (xo, xh)

p(xo, xh) p(xo, xh, y)dxodxhdy,

where the assumption, p(y|xo, xh) = p (y|xo, xh), is used. Since the new features do not appear in the labeled training data, it is impossible to obtain the distribution of the new features from the labeled training data. Thus, we assume that the distribution of the new features given the existing features in the training phase is the same with the distribution in the test phase, p(xh|xo) = p (xh|xo) which enables us to learn classiﬁers that exploits new features. Then, the generalization error G is approximated by the following empirical risk,

n τ(xo n) Z loss(yn, h(xo n, xh n))p(xh n|xo n)dxh n,

(3) which is estimated using the labeled training data and the unlabeled test data. Here, τ(xo) := p (xo)

p(xo) is the importance weight at xo. This weight depends only on the existing features but not on the newly emerged features. The empirical risk at the test phase (3) can be regarded as a importanceweighted average of the empirical risk at the training phase. To minimize the empirical risk (3), we must estimate the conditional distribution p(xh n|xo n) and the importance weight τ(xo) at ﬁrst.

3.2 Conditional Distributions of New Features We estimate the distribution of the new features given the existing features p(xh n|xo n) by assuming it is modeled as the following multivariate normal distribution,

p(xh n|xo n) = N xh n Axo n + a, Λ 1 , (4)

where N(µ, Σ) is a multivariate normal distribution with mean µ and covariance matrix Σ, A is a Dh Do matrix for deﬁning transformation from existing features xo to new features xh, a RDh is a bias term, and Λ is a Dh Dh precision matrix. We assume that the matrix A is lowrank and can be decomposed into a product of two matrixes, A = BC, B RDh K, C RK Do. We can decrease the number of parameters by setting K < Dh Do Dh+Do , which enables us to handle high-dimensional feature vectors. In addition, for simplicity, we restrict the precision matrix Λ to a diagonal matrix as Λ = diag(λ)2+ϵIDh, where λ = (λ1, . . . , λDh) RDh is a precision vector, diag(x) returns a diagonal matrix whose diagonal elements are x, ϵ is a positive constant, and Ik is the k k identity matrix. The term ϵIDh is added in order to ensure the positive deﬁniteness of Λ. We estimate the unknown parameters B, C, a, and λ on the basis of the maximum a posteriori (MAP) estimation by using unlabeled data D = {(xo m, xh m)}M m=1. The conditional density p(xh n|xo n) are estimated by using samples D drawn from test distribution p (x) but not using samples drawn from training distribution p(x) since p(xh|xo) = p (xh|xo) and the training

Proceedings of the Twenty-Sixth International Joint Conference on Artiﬁcial Intelligence (IJCAI-17)

data do not contain new features. For the priors of B and C, we assume the normal distributions N(0Dh K, r 1IDh K) and N(0Do K, r 1IDo K), where 0k denotes the k-dimensional zero vector and r R+ is a precision parameter. For the prior of the elements of the precision vector λ2 k, we assume a Gamma distribution Gam(λ2 k|a, b). Then, the log posterior F := log p(B, C, a, λ|D ) is given by

F = log p(D |B, C, a,λ)+log p(B)+log p(C)+log p(λ2)

m (xh m BCxo m a) Λ(xh m BCxo m a)

2Tr(C C) + 1

2log det(Λ)

k 2(a 1)logλk bλ2 k + const, (5)

where Tr is a trace, denotes a transposition, and det is a determinant. The log posterior F is maximized using gradientbased methods over the parameters B, C, a, λ. Although we use a multivariate normal distribution to model p(xh n|xo n) for simplicity, it is possible to use other distributions such as Gaussian mixtures and Bernoulli distributions. In addition, although we modeled the mean of the multivariate normal distribution of xh given xo by linear regression, it is possible to use more powerful non-linear functions such as neural nets.

3.3 Importance Weight We utilize the unconstrained least-square importance ﬁtting approach (u LSIF) [Kanamori et al., 2009] for estimating the importance weight τ(xo). Since the u LSIF was shown to have excellent numerical stability and efﬁcient run-time solution [Sugiyama et al., 2013], we chose the u LSIF as the importance estimating method. The u LSIF models the importance τ(xo) by using the following linear model, ˆτ(xo) = α ψ(xo), where α = (α1, . . . , αL) RL is a parameter vector, L is the number of parameters, and ψ(xo) RL are basis functions. Although the original u LSIF implicitly assumes that the importance weight depends on both existing and new features, the u LSIF utilized in the proposed method depends only on the existing features. We use the Gaussian kernel model centered at the existing features of unlabeled data xo m as basis functions: ˆτ(xo) = PM m=1 αm exp xo xo m 2

2γ2 , where is ℓ2-norm and

γ2 denotes the Gaussian width. The parameter vector α is learned so that the following objective function J(α) is minimized,

J(α) = Z (τ(xo) ˆτ(xo))2 p(xo)dxo + ρ α 2, (6)

which is the expected squared error with ℓ2 regularization. With the empirical approximation, this objective function can be rearranged as

J(α) α Hα 2α h + ρ α 2, (7)

where H = 1 N P n ψ(xo n)ψ(xo n) and h = 1 M P m ψ(xo m). This optimization problem is a convex and its minimizer ˆα can be analytically calculated by ˆα =

N P n ψ(xo n)ψ(xo n) + ρI 1 1

M P m ψ(xo m) . Since elements of the estimated parameter ˆα can be negative, we modify them with α = max(0L, ˆα), where max is applied in the element-wise manner [Kanamori et al., 2009].

3.4 Learning Classiﬁers with Importance Weights While Integrating out New Features We employ the negative log likelihood log p(yn|xo n, xh n) as a loss function in (3) though it is possible to use other loss functions such as the 0/1-loss or hinge-loss. We assume that the conditional probability of class label y given feature vector (xo, xh) is modeled by logistic regression, p(y = 1|xo, xh, w) = σ w (xo, xh) , where w RDo+Dh is a parameter vector of the classiﬁer h, and σ( ) is the sigmoid function. We would like to minimize the empirical risk (3) with respect to the model parameter w. However, the aforementioned integral is intractable because of the nonconjugacy of p(y|xo, xh, w). To deal with this problem, we use the following inequality [Bishop, 2006],

p(yn|xo n, xh n, w)

eynsnσ(ξn)exp sn+ξn

2 g(ξn)(s2 n ξn 2) , (8)

where sn := w (xo n, xh n), ξn R is a parameter that is associated with each labeled training sample and controls the accuracy of the approximation, and g(ξn) = 1 2ξn σ(ξn) 1

2 . By substituting the term on the right side of (8) with the negative log likelihood in the empirical risk (3), we obtain the new objective function G. Since G is the upper bound of the original objective function G, decreasing the value of G leads to an decreased value of G. Using the ﬁrst and second moments of the normal distributions, R xh np(xh n|xo n)dxh n = BCxo n + a (=: µ(xo n)), R xh nxh n p(xh n|xo n)dxh n = µ(xo n)µ(xo n) + Λ 1, we write the objective function G with ℓ2regularizer to be minimized as follows,

n ˆτ(xo n) Z 1

2ξn g(ξn)ξ2 n log σ(ξn) +(1

xo n wo+g(ξn)(xo n wo)2 p(xh n|xo n)

2 yn)xh n wh + 2g(ξn)xo n woxh n wh

+g(ξn)(xh n wh)2 p(xh n|xo n)dxh n + 1

n ˆτ(xo n) 1

2ξn g(ξn)ξ2 n log σ(ξn) + (1

xo n wo + g(ξn)(xo n wo)2 + (1

2 yn)µ(xo n) wh

+2g(ξn)xo n woµ(xo n) wh + g(ξn)[(µ(xo n) wh)2

+w h Λ 1wh] + 1

2c w 2, (9)

where w = (wo, wh) RDo+Dh and c is a positive constant. The objective function G is minimized using gradient-based methods over the parameters w = (wo, wh), {ξn}N n=1.

Proceedings of the Twenty-Sixth International Joint Conference on Artiﬁcial Intelligence (IJCAI-17)

Figure 4: Synthetic Data in case of (π1, π2) = (0.15, 0.75)

Figure 5: Average and standard error of AUC varying mixture ratio (π1, π2) in synthetic data

4 Experiments We conducted experiments using one synthetic and three realworld data sets to conﬁrm the effectiveness of the proposed method.

4.1 Synthetic Data Each training sample xo n R is drawn from π1N( 3, 1) + π2N(6, 1) if yn = 1 and N(0, 1) if yn = 0, where πi R+ are mixture ratios satisfying π1 + π2 = 1. Each test sample xm = (xo m, xh m) R2 is drawn from N (( 3, 6), I2) if ym = 1 and N ((0, 2), I2) if ym = 0, where new feature xh m is added in the test distribution. We illustrate this data in Figure 4. In the experiments, we generate 100 labeled training samples for each label y and 100 test samples for each label y.

4.2 Real-world Data We used three real-world data sets: SPAM, URL, and LOG. SPAM is a collection of spam and legitimate email received by one from February 1st of 2003 to January 31th of 2004 [Gama et al., 2014]. This data set is the same as that stated in Figure 1 and 2. The total number of emails is 11, 905, and the number of features is 166, 047. URL is a public data set of malicious and normal urls collected over 120 days [Ma et al., 2009]. The total number of examples is 2, 396, 130, and the number of features 3,231,961. These two data sets are well-used in concept drift literatures. LOG is our in-house data collection of malicious and normal Web Proxy logs. Malicious Web Proxy logs were created on the basis of malware communications. Since malware evolve over time, the malware communications also change over time. Each log includes information such as source IP address, destination IP address and url.

4.3 Setting In our experiments using the synthetic data set, we evaluated AUC, which is a well-used evaluation measure for classiﬁcation tasks, varying the values of the mixture ratio (π1, π2). We created ten different sets and evaluated the average AUC for each mixture ratio (π1, π2). In our experiments using the real-world data sets, we calculated AUC by using samples collected until a certain time point for labeled training data and samples after the certain time point for unlabeled training and test data for evaluating the classiﬁcation performance on latest data. For SPAM, roughly, samples collected in the

n-th month were used for labeled data, samples in the n+1-th month for unlabeled data, and samples in the n + 2-th month for test data, where n is 2, 3, , 11. Therefore, we obtained ten different sets in total, and we evaluated the average AUC over the these sets. For URL, we ﬁrst split 120 days, making a split every 10 days, and samples collected in the n-th unit were used for labeled data, samples in the n + 1-th unit for unlabeled data, and samples in the n + 2-th unit for test data, where n is 1, 2, , 10. For LOG, samples collected in a day were used for labeled data, and samples collected over two days after the day, which labeled data were collected in, were used for unlabeled or test data. We compared the proposed method with four existing methods: logistic regression (LR), imputation logistic regression (ILR), importance weighted logistic regression (IWLR) and semi-supervised logistic regression by minimum entropy regularization (MER). LR learns classiﬁers by using only labeled data D:={(xo n, yn)}N n=1. ILR learns classiﬁers by logistic regression after completing labeled data xo n with the conditional mean in (4). IWLR learns classiﬁers by logistic regression with the importance weighted method (specifically, by u LSIF [Kanamori et al., 2009]). MER is a semisupervised extension of logistic regression [Grandvalet and Bengio, 2004]. MER learns classiﬁers so as to separate unlabeled data as much as possible on the basis of minimum entropy regularization. To evaluate the effectiveness of considering new features and importance weights, we used logistic regression for classiﬁers with all methods including the proposed method. In our experiments, we chose the optimal hyper parameters for these methods from the following variations by using validation data: regularization parameter for classiﬁers c {10 1, 1, 101} in all methods, regularization parameter for importance ρ {10 1, 1, 101} in the proposed method and IWLR, regularization parameter for imputation r, b {10 1, 1, 101}, a {1}, and imputation parameter K {1, 3, 6, 9} in the proposed method and ILR. For the proposed method and IWLR, the Gaussian width of the Gaussian Kernel γ is determined by median trick, that is, γ is set by the median of squared distance between training points (labeled and unlabeled data). The weighting parameter of unlabeled data for MER is chosen in {{0.1 10 n, 0.5 10 n}3 n=0, 1}, The correction term ϵ is set by 10 4 for the proposed method and ILR.

4.4 Results We ﬁrst show the classiﬁcation performance in the synthetic data sets. Figure 5 shows the average and standard error of AUC, varying the values of the mixture ratio (π1, π2). When π2 was 0.25, there was not much difference between each method. This is because the number of training samples with y = 1 located on the left side was much larger than training samples with y = 1 located on the right side, and therefore, the optimal decision boundary, whose normal vector points to the left direction, can be easily learned in every method. When π2 was 0.4 and 0.55, the AUCs of the methods except the proposed method and IWLR became poor since they were affected by a lot of training samples with y = 1 located on the right side. However, the proposed method and IWLR could

Proceedings of the Twenty-Sixth International Joint Conference on Artiﬁcial Intelligence (IJCAI-17)

Table 1: Average and standard error of AUC over N/M = {0.25, 0.5, 1.0}, where N labeled samples and M ﬁxed unlabeled samples. Boldface indicates the best method, which is signiﬁcantly better than the method(s) marked with (in paired t-test, p=0.05)

Proposed LR ILR IWLR MER SPAM 0.8691 (0.0162) 0.8376 (0.0231) 0.8371 (0.0227) 0.8393 (0.0224) 0.8311 (0.0231) URL 0.9845 (0.0016) 0.9838 (0.0017) 0.9839 (0.0017) 0.9839 (0.0017) 0.9841 (0.0017) LOG 0.6489 (0.0111) 0.6399 (0.0111) 0.6482 (0.0109) 0.6403 (0.0109) 0.6398 (0.0111)

Table 2: Average number of existing features Do and new features Dh in the case when #labeled samples equals to #unlabeled samples (N/M = 1.0)

Dataset Do Dh SPAM 14,915 10,983 URL 9,677 6,740 LOG 20,670 15,916

learn a good decision boundary since the importance weights of training samples located on the left side are bigger than those located on the right side. When π2 was 0.7 and 0.85, only the proposed method achieved a high AUC, although IWLR became inaccurate. This is because the importance weights estimated by the proposed method only depend on existing features xo, although the importance weights with IWLR depend on existing and new features (xo, xh) , and therefore, IWLR tends to give large importance weights to training samples with y = 1 on the right side. Overall, the proposed method achieved a good classiﬁcation performance compared with the others in the situation, where there were new features and training and test distributions differ. Next, we investigate the classiﬁcation performance when varying the number of labeled samples for the real-world data sets. Here we set the number of unlabeled and test samples uniformly so that it did not exceed the smallest number of samples in each time unit (here, the time unit is a month in the case of SPAM, ten days in the case of URL, and a day in the case of LOG). As a result, the number of unlabeled samples for SPAM, URL and LOG were 500, 1000 and 600, respectively. The number of test samples was also the same number of unlabeled samples. Table 2 shows the number of existing features Do and new features Dh in the case of N/M = 1.0, where N is the number of labeled samples and M is the number of ﬁxed unlabeled samples, for each realworld data set. Table 1 shows the average and standard error of AUC over N/M = {0.25, 0.5, 1.0}. For all data sets, the proposed method achieved the highest average of AUC over all N/M. In addition, the proposed method outperformed the other methods in many N/M (the proposed method showed the best results 6 out of 9 times). Although ILR and IWLR somewhat improved the average of AUC over all N/M compared with LR, the proposed method improved AUC further. This result suggests that it is better to learn classiﬁers taking into account two problems, that is, the emergence of new features and the distribution change, at the same time more than when they are considered separately. Last, we investigated how the classiﬁcation performance of the proposed method changed against imputation parameter K, which determines the number of parameters to be

Figure 6: Average of AUC over the N/M when varying value of imputation parameter K

estimated for the conditional probability with new features given existing features. The number of labeled and unlabeled samples was same as in the earlier experiments. Figure 6 represents the averages of AUC over N/M = {0.25, 0.5, 1.0} when changing the value of imputation parameter K within {1, 3, 6, 9}. Here, the AUCs of LR, IWLR, and MER were constant when varying the value of K since they do not depend on the value of K. Overall, the proposed method achieved a good classiﬁcation performance for all the values of imputation parameter K. For SPAM and URL, the AUC of the proposed method was better than the other methods for all imputation parameters K. For LOG, the proposed method outperformed the others except in the case of K = 1. Nevertheless, even with K = 1, the proposed method achieved a relatively good classiﬁcation performance. One of the reasons the proposed method with K = 1 achieved a good classiﬁcation performance is that only a small part of the new features affected the classiﬁcation performance. That is, there is a possibility that only a small K is enough to improve the classiﬁcation performance even though it is not sufﬁcient to represent a correlation of new and existing features perfectly. As a result, we assume that the proposed method requires only a small K to classify data correctly.

5 Conclusion

We proposed a novel method for learning the latest classiﬁers by using labeled data collected beforehand and newly obtained unlabeled data. In experiments, we conﬁrmed that the proposed method outperformed the various existing methods in the situation , where there were new features and training and test distributions differ. As future work, we will extend the proposed method to on-line learning in order to be able to apply it to large-scale data. In addition, applying deep learning methods for modeling conditional distribution of new features is also interesting.

Proceedings of the Twenty-Sixth International Joint Conference on Artiﬁcial Intelligence (IJCAI-17)

[Abdallah et al., 2012] Zahraa Said Abdallah, Mohamed Medhat Gaber, Bama Srinivasan, and Shonali Krishnaswamy. Streamar: incremental and active learning with evolving sensory data for activity recognition. In ICTAI, 2012. [Bickel and Scheffer, 2007] Steffen Bickel and Tobias Scheffer. Dirichlet-enhanced spam ﬁltering based on biased samples. NIPS, 2007. [Bickel et al., 2008] Steffen Bickel, Jasmina Bogojeska, Thomas Lengauer, and Tobias Scheffer. Multi-task learning for hiv therapy screening. In ICML, 2008. [Bishop, 2006] Christopher M Bishop. Pattern recognition and machine learning. springer, 2006. [Crammer et al., 2009] Koby Crammer, Alex Kulesza, and Mark Dredze. Adaptive regularization of weight vectors. In NIPS, 2009. [Donders et al., 2006] A Rogier T Donders, Geert JMG van der Heijden, Theo Stijnen, and Karel GM Moons. Review: a gentle introduction to imputation of missing values. JCE, 2006. [Fdez-Riverola et al., 2007] Florentino Fdez-Riverola, Eva Lorenzo Iglesias, Fernando D ıaz, Jos e Ramon M endez, and Juan M Corchado. Applying lazy learning algorithms to tackle concept drift in spam ﬁltering. ESWA, 2007. [Gama et al., 2014] Jo ao Gama, Indr e ˇZliobait e, Albert Bifet, Mykola Pechenizkiy, and Abdelhamid Bouchachia. A survey on concept drift adaptation. CSUR, 2014. [Grandvalet and Bengio, 2004] Yves Grandvalet and Yoshua Bengio. Semi-supervised learning by entropy minimization. In NIPS, 2004. [Hachiya et al., 2012] Hirotaka Hachiya, Masashi Sugiyama, and Naonori Ueda. Importance-weighted least-squares probabilistic classiﬁer for covariate shift adaptation with application to human activity recognition. Neurocomputing, 2012. [Haque et al., 2016] Ahsanul Haque, Latifur Khan, and Michael Baron. Sand: Semi-supervised adaptive novel class detection and classiﬁcation over data stream. In AAAI, 2016. [Huang et al., 2006] Jiayuan Huang, Arthur Gretton, Karsten M Borgwardt, Bernhard Sch olkopf, and Alex J Smola. Correcting sample selection bias by unlabeled data. In NIPS, 2006. [Kanamori et al., 2009] Takafumi Kanamori, Shohei Hido, and Masahi Sugiyama. Efﬁcient direct density ratio estimation for non-stationarity adaptation and outlier detection. In NIPS, 2009. [Kingma et al., 2014] Diederik P Kingma, Shakir Mohamed, Danilo Jimenez Rezende, and Max Welling. Semisupervised learning with deep generative models. In NIPS, 2014.

[Klinkenberg, 2004] Ralf Klinkenberg. Learning drifting concepts: Example selection vs. example weighting. Intelligent Data Analysis, 2004. [Kumagai and Iwata, 2016] Atsutoshi Kumagai and Tomoharu Iwata. Learning future classiﬁers without additional data. In AAAI, 2016. [Kumagai and Iwata, 2017] Atsutoshi Kumagai and Tomoharu Iwata. Learning non-linear dynamics of decision boundaries for maintaining classiﬁcation performance. In AAAI, 2017. [Li et al., 2010] Yan Li, Hiroyuki Kambara, Yasuharu Koike, and Masashi Sugiyama. Application of covariate shift adaptation techniques in brain computer interfaces. Biomedical Engineering, IEEE Transactions on, 2010. [Liu and Ziebart, 2014] Anqi Liu and Brian Ziebart. Robust classiﬁcation under sample selection bias. In NIPS. 2014. [Ma et al., 2009] Justin Ma, Lawrence K Saul, Stefan Savage, and Geoffrey M Voelker. Identifying suspicious urls: an application of large-scale online learning. In ICML, 2009. [Nigam et al., 2000] Kamal Nigam, Andrew Kachites Mc Callum, Sebastian Thrun, and Tom Mitchell. Text classiﬁcation from labeled and unlabeled documents using em. Machine Learning, 2000. [Pan and Yang, 2010] Sinno Jialin Pan and Qiang Yang. A survey on transfer learning. TKDE, 2010. [Rubin, 2004] Donald B Rubin. Multiple imputation for nonresponse in surveys. John Wiley & Sons, 2004. [Shimodaira, 2000] Hidetoshi Shimodaira. Improving predictive inference under covariate shift by weighting the log-likelihood function. JSPI, 2000. [Sugiyama et al., 2008] Masashi Sugiyama, Shinichi Nakajima, Hisashi Kashima, Paul V Buenau, and Motoaki Kawanabe. Direct importance estimation with model selection and its application to covariate shift adaptation. In NIPS, 2008. [Sugiyama et al., 2013] Masashi Sugiyama, Makoto Yamada, and Marthinus Christoffel du Plessis. Learning under nonstationarity: covariate shift and class-balance change. Wiley Interdisciplinary Reviews: Computational Statics, 2013. [Van der Maaten and Hinton, 2008] Laurens Van der Maaten and Geoffrey Hinton. Visualizing data using t-sne. JMLR, 2008. [Wang et al., 2003] Haixun Wang, Wei Fan, Philip S Yu, and Jiawei Han. Mining concept-drifting data streams using ensemble classiﬁers. In KDD, 2003. [Zhang et al., 2013] Kai Zhang, Vincent Wenchen Zheng, Qiaojun Wang, James Tin-Yau Kwok, Qiang Yang, and Ivan Marsic. Covariate shift in hilbert space: a solution via sorrogate kernels. In ICML, 2013.

Proceedings of the Twenty-Sixth International Joint Conference on Artiﬁcial Intelligence (IJCAI-17)