# learning_with_previously_unseen_features__1cb213f7.pdf

Learning with Previously Unseen Features

Yuan Shi Computer Science Department University of Southern California yuanshi@usc.edu

Craig A. Knoblock Information Sciences Institute University of Southern California knoblock@isi.edu

We study the problem of improving a machine learning model by identifying and using features that are not in the training set. This is applicable to machine learning systems deployed in an open environment. For example, a prediction model built on a set of sensors may be improved when it has access to new and relevant sensors at test time. To effectively use new features, we propose a novel approach that learns a model over both the original and new features, with the goal of making the joint distribution of features and predicted labels similar to that in the training set. Our approach can naturally leverage labels associated with these new features when they are accessible. We present an efﬁcient optimization algorithm for learning the model parameters and empirically evaluate the approach on several regression and classiﬁcation tasks. Experimental results show that our approach can achieve on average 11.2% improvement over baselines.

1 Introduction

We consider a setting where a machine learning system can access features that are previously unseen in the labeled training set. This often happens when machine learning systems are deployed in an open environment. For example, consider a weather station equipped with a set of sensors measuring temperature, wind speed, pressure, etc. Suppose the station has a machine learning model that predicts humidity from the sensor readings. The prediction model is built on a training set containing historical sensor readings and the corresponding humidity. Now, the station plugs in a new sensor that measures dew point. Since dew point is strongly correlated with humidity, leveraging this new sensor could potentially improve the model. As a second example, consider a model that predicts a job applicant s quality. It is built on a training set where each sample contains features from the resume of a previous applicant and the quality level labeled by HR. Now, for future applicants, the model is allowed to access applicants social media such as Facebook and Twitter. Exploring new features

extracted from social media may improve the prediction quality since social media data often reﬂect an applicant s personality and perspectives, which may not be revealed in resumes. In the above examples, the new features are likely to provide complementary information over the original features in the labeled training set. Identifying and leveraging such features can potentially improve the underlying machine learning models. This is an important step towards developing intelligent systems that can interact and learn from an open environment. The challenge is that there is no labeled samples associated with those additional features. Collecting new labels can be costly and often requires human input. It is desirable to develop an approach that can automatically exploit previously unseen features. This problem can be viewed in the framework of domain adaptation or transfer learning [Pan and Yang, 2010; Daume III and Marcu, 2006; Quionero-Candela et al., 2009], which aims to adapt machine learning models trained on a source domain to a related but different target domain. Our problem can be viewed as a special case of heterogeneous domain adaptation, where the feature space of the source domain is different from that of the target domain. Although being a special case, our problem setting actually poses unique challenge for existing domain adaptation approaches. Previous work on heterogeneous domain adaptation mainly consists of two types of approaches. One type learns a transformation to map the features from one domain to the other [Dai et al., 2008; Socher et al., 2013; Zhou et al., 2014] by leveraging sample-level correspondences across domains (e.g., an image and its tag). This type of approach can be applied to our setting naturally since each target-domain sample provides a correspondence between the original and new features. However, learning a good transformation may not be possible when the dependency between the original and new features is weak.1 In our context, the new features are expected to contain complementary information not available in the original features, making such transformation difﬁcult to obtain. A second type of approach maps the source and target domains into a domain-invariant feature space in which they distribute similarly (in terms of the features in that space), and

1Weak dependency implies that there is no strong correspondence between original and new features. In other words, knowing original features is not very helpful when predicting new features.

Proceedings of the Twenty-Sixth International Joint Conference on Artiﬁcial Intelligence (IJCAI-17)

then learns a model in that space using labeled data [Kulis et al., 2011; Wang and Mahadevan, 2011; Argyriou et al., 2008; Duan et al., 2012; Shi et al., 2010; Harel and Mannor, 2010; Wei and Pal, 2011; Yeh et al., 2014]. These approaches are not applicable to our setting because the original features already provide a domain-invariant feature space, which would completely ignore new features in the target domain. Our problem can also be viewed as a special case of semisupervised learning [Zhu, 2005], with additional features associated with unlabeled data. To the best of our knowledge, such problem setting has not been explored before because existing semi-supervised learning algorithms focus on using unlabeled data from the same feature space of labeled data. To address the challenge, we propose an approach named LUF (Learning with previously Unseen Features) that builds a machine learning model over both the original and new features. Intuitively, the desirable model should predict targetdomain labels consistent with the labels in the source domain. We express the consistency in the notion of the joint distributions over the original features and labels, and we enforce the two domains to have similar joint distributions. LUF minimizes k-nearest neighbor distances across domains in the joint space and can be solved by an efﬁcient optimization algorithm. LUF can be simply extended to the case where there exist labeled samples with new features. We evaluate LUF extensively on benchmark regression and classiﬁcation datasets and a sensor adaptation application in the weather domain. In most of the evaluation cases, LUF outperforms other baseline methods, with an average improvement of 11.2%. Contributions. To summarize, we study a new machine learning setting involving features that are not available in the labeled training set. We then propose a novel approach as well as an efﬁcient optimization algorithm. Our approach is applicable to both regression and classiﬁcation tasks and can naturally leverage labels associated with these new features. Our empirical study highlights the efﬁcacy of the proposed approach.

We are given N labeled samples {(xs, ys)}N s=1, where xs RD is the input features, and ys is the corresponding label. Additionally we are given M unlabeled samples {(xt, zt)}M t=1, where xt RD follows the same distribution of xs. zt RW are the additional features associated with xt. In the following, we refer to the set of labeled samples as the source domain, and the set of unlabeled samples as the target domain. We are interested in learning a model for the target domain that maps (x, z) to y. In the following, we denote the model as fθ(x, z) where θ represents the model parameters, and the predicted label as ˆy = fθ(x, z). For each source-domain sample (xs, ys), if we could estimate its ˆzs reliably, we can simply train fθ(x, z) on the source domain. However, estimating z from x can be challenging when their dependency is weak, as observed in our empirical study. On the other hand, training on the target domain is very challenging as there are no labels available.

We tackle the challenge based on the following intuition: if our model predicts target-domain labels {ˆyt} well, then {ˆyt} should be consistent with the training labels. Such consistency can be expressed through some joint patterns between x and y, for instance, the joint distribution of (x, y). Ideally, our model fθ would make the joint distribution similar across domains, which motivates us to seek fθ such that {(xs, ys)} and {(xt, ˆyt)} are mixed as much as possible. When this happens, each source-domain sample (xs, ys) becomes close to its k-nearest neighbors in the target domain, and vise versa. Therefore, we propose the following objective function to minimize the cross-domain k-nearest neighbor distances in the joint space of (x, y)

t N k T (s) dist[(xs, ys), (xt, ˆyt)]

s N k S (t) dist[(xt, ˆyt), (xs, ys)] + λ θ 2 2 (1)

where dist is the distance function deﬁned on (x, y). N k T (s) denotes the set of indices corresponding to (xs, ys) s knearest neighbors in the target domain, and N k S(t) denotes the set of indices corresponding to (xt, yt) s k-nearest neighbors in the source domain, where nearest neighbors are determined based on dist. θ 2 2 is the regularization term on θ with λ 0 as the regularization parameter. For simplicity, we set dist to be the following weighted distance2

dist[(xs, ys), (xt, ˆyt)] = xs xt 2 2 + γ (ys, ˆyt) (2)

where (ys, ˆyt) measures the distance between ys and ˆyt, and γ > 0 is a weight balancing the scale of x and y. Let v2 st = xs xt 2 2, we can write (1) into

t N k T (s)

v2 st + γ (ys, fθ(xt, zt))

s N k S (t)

v2 ts + γ (ys, fθ(xt, zt)) + λ θ 2 2 (3)

For regression tasks, we simply set (ys, ˆyt) = ys ˆyt 2. For classiﬁcation tasks with C classes, we use probabilistic classiﬁcation models (e.g. logistic regression [Bishop, 2006]) that can predict class probability for a given sample. In this case, ˆyt is a C-dimensional vector representing the probability in each class, and ys is a C-dimensional binary vector. We can set (ys, ˆyt) = 1 PC c=1 ys(c)ˆyt(c), so that a small corresponds to similar class assignment. Note that N k T (s) and N k S(t) are dependent on θ, hence (3) is non-smooth and non-convex in θ. In the following, we present an efﬁcient optimization algorithm for ﬁnding a local minimum of Eq. (3).

2In our implementation, features are ﬁrst normalized into similar scales.

Proceedings of the Twenty-Sixth International Joint Conference on Artiﬁcial Intelligence (IJCAI-17)

2.1 Alternating Optimization For the ease of optimization, we introduce a set of auxiliary variables to decouple the dependency of N k T (s) and N k S(t) on θ. Let Vk T (s) index (xs, ys) s any (not necessarily the nearest) k neighbors in the target domain, and Vk S(t) index (xt, yt) s any k neighbors in the source domain. It is easy to see X

t N k T (s)

v2 st + (ys, fθ(xt, zt))

= min Vk T (s)

v2 st + (ys, fθ(xt, zt)) (4)

and the same relationship holds for Vk S(t) and N k S(t). Thus (3) is equivalent to

min θ,{Vk T (s)},{Vk S(t)}

v2 st + (ys, fθ(xt, zt))

v2 ts + (ys, fθ(xt, zt)) + λ θ 2 2 (5)

(5) can be efﬁciently optimized via an alternating procedure. When θ is ﬁxed, we update {Vk T (s)} and {Vk S(t)} based on nearest neighbor search. When {Vk T (s)} and {Vk S(t)} are ﬁxed, we optimize θ by solving

t Vk T (s) (ys, fθ(xt, zt))

s Vk S(t) (ys, fθ(xt, zt)) + λ θ 2 2 (6)

which can be easier to optimize than (3) when fθ is smooth in θ. For regression tasks, if linear regression is used, θ can be solved analytically. For classiﬁcation tasks, if logistic regression is used, a global optimum of θ can be solved using gradient descent techniques. The above alternating procedure decreases (5) at each alternating step, and converges to a local minimum of (3). Empirically, the procedure converges quickly (usually within 50 iterations). The outline of the algorithm is described in Algorithm 1.

Algorithm 1 Optimization algorithm for LUF

Input: source-domain samples {(xs, ys)}N s=1 and targetdomain samples {(xt, zt)}M t=1, neighbor size k, weight parameter γ, and regularization parameter λ. Initialize θ by solving Eq. (8) for iter = 1, 2, , T do for s = 1, 2, , N do Fix θ, update Vk T (s) for t = 1, 2, , M do Fix θ, update Vk S(t) Fix {Vk T (s)}, {Vk S(t)}, optimize θ by solving Eq. (6) Output: a local optimal solution θ .

Initialization: The quality of the solution depends on how we initialize θ. Intuitively, if we could get a good estimate of ˆzs based on each source-domain sample xs, we can initialize θ by solving

s (ys, fθ(xs, ˆzs)) (7)

However, estimating ˆzs can be very challenging when the dependency between x and z is weak. Nevertheless, we can estimate a candidate set of ˆzs based on target-domain data as follows: for each xs, we ﬁnd a set of its nearest neighbors in {xt}, and use the corresponding zt to form a candidate set Zs. We then minimize the model error by optimizing both θ and {ˆzs}:

s min ˆzs Zs (ys, fθ(xs, ˆzs)) (8)

where ˆzs is allowed to be any element of Zs. (8) essentially relaxes the dependency between x and z, and uses the optimal θ for the relaxed setting as an initialization. By setting {Zs} to different sizes, we can get different initial solutions for θ.

Complexity analysis: In Algorithm 1, each iteration iter involves N updates on Vk T (s) and M updates on Vk S(t). Each update on Vk T (s) takes O(MD2) and each update on Vk S(t) takes O(ND2), where D is the dimensionality of original features. Therefore the complexity of updating {Vk T (s)} and {Vk S(t)} at each iteration is O(NMD2). Additionally, each iteration iter involves leaning a linear regression or logistic regression function, whose complexity is O((D + W)2(N + M)) where W is the dimensionality of new features and we assume N > D + W. Further assuming W = O(D) and M = O(N), we have the overall complexity as O(D2N 2).

Hyper-parameter tuning and overﬁtting prevention: When there are no labels in the target domain, tuning hyperparameters is often very difﬁcult. We tune the neighborhood size k by reducing the error of a k-NN classiﬁer or regressor on the source domain. To tune the weight γ and regularization parameter λ, we apply a leave-one-out cross validation strategy on the source domain to simulate our problem setting. Speciﬁcally, we consider each feature in the source domain as z, and the rest of the features as x. We then split sourcedomain samples into two disjoint sets: one set has only x, and the other set has both x and z. We treat the ﬁrst set as a synthesized source domain and the second set as a synthesized target domain, and apply the proposed approach to pick the optimal γ and λ. To prevent overﬁtting, we adopt an early stopping strategy: train a model on {(xt, ˆyt)} and apply it to source-domain data. If the prediction error on the source domain is larger than a certain threshold, we stop the learning process. We also terminate the optimization when the objective function decreases very slowly, which not only saves computational time but also reduces overﬁtting.

Leveraging labels in the target domain: LUF can naturally leverage labels in the target domain when they are available. Suppose a subset of target-domain samples are labeled,

Proceedings of the Twenty-Sixth International Joint Conference on Artiﬁcial Intelligence (IJCAI-17)

and let L denote their sample indicies. We can directly enhance LUF by adding a supervised term into the objective function

t N k T (s)

v2 st + γ (ys, fθ(xt, zt))

s N k S (t)

v2 ts + γ (ys, fθ(xt, zt)) + λ θ 2 2 (9)

t L (yt, fθ(xt, zt)) (10)

where µ 0 is a new weight parameter that can be tuned by cross validation on the target domain.

3 Experiments

We evaluate LUF on several regression and classiﬁcation datasets and an application on sensor adaptation for weather stations.3

We develop the following methods for comparison:

R: kernel regression [Bishop, 2006] trained on the source domain. Polynomial kernel4 k(xi, xj) = (xi Txj + c)d is used where c and d are kernel parameters.

R-ZKR, R-ZNN. These are domain adaptation methods that explore the correspondences between x and z in the target domain. They ﬁrst estimate ˆz for each sourcedomain sample, and then train a kernel regression model over {(xs, ˆzs)} on the source domain. Two strategies are explored for estimating ˆz:

ZKR: train a kernel regression on the target domain to map xt to zt, and use that model to predict ˆzs for source-domain sample xs. ZNN: similar to ZKR except using a k-NN regression model .

R-Z*: kernel regression directly trained on the target domain, using the target-domain labels ( cheating ). This method, though unrealistic in practice, provides an upper bound on the best possible performance. We apply ten-fold cross validation on the target domain and report the average error.

C: logistic regression [Bishop, 2006] trained on the source domain. Since the dimensionality of the classiﬁcation datasets is relatively high, we directly use input features and do not apply polynomial kernel.

C-ZKR, C-ZNN, C-Z*. These methods are the same as R-ZKR, R-ZNN, R-Z* respectively, except that kernel regression is switched to logistic regression.

3Our algorithms and datasets can be accessed from https://github.com/yuanshi/Unseen Features 4We use polynomial kernel over other nonlinear kernels because it has explicit form for the nonlinear features. For most datasets, we ﬁnd polynomial kernel performs comparably to Gaussian RBF kernel. Our focus is more on leveraging new features than choosing the best kernels.

LUF: the proposed approach for learning with previously unseen features. For regression tasks, it trains a linear model based on the explicit feature mappings derived from polynomial kernel. For classiﬁcation tasks, it trains a logistic regression based on input features.

The hyper-parameters of the above methods, including c and d in polynomial kernel, k in k-NN regression, the regularization parameter λ, are tuned on the source domain. For R-Z* and C-Z*, their regularization parameters are tuned on the target domain.

3.1 Results on Regression Datasets

We experiment with four regression datasets

Abalone,5 for predicting the age of abalone, contains 4,177 samples with 8 features.

Bank,6 which predicts the fraction of bank customers that are turned away due to queuing, contains 8,192 samples with 8 features.

CPU,7 for CPU running time prediction, contains 8,192 samples with 12 features.

House,8 for housing price prediction, contains 20,640 samples with 9 features.

In order to select a subset of features to be unseen , we look at features with high predictive power, which enables a machine learning model to improve prediction performance by leveraging these features.9 Speciﬁcally, we do the following check for each feature: if removing it increases the prediction error by more than 10 percent, then it is selected as an unseen feature. We then sort these unseen features based on their impact on the prediction errors in descending order. Table 1 shows the results when the top set of unseen features are explored. For each dataset, we conduct experiments in 10 random trials. In each trial, we randomly split the dataset into the source/target domain, each with half the number of samples. We report the average root mean square error (RMSE) and standard error on the target domain.10 As a preprocessing step, we normalize each feature to [0,1]. Table 1 summarizes the prediction error of different methods, where the improvement(%) is computed based on the best performing baseline. In most cases, LUF outperforms the other baselines, with an average improvement of 12.1%.

5https://archive.ics.uci.edumldatasets Abalone 6http://www.cs.toronto.edu/ delve/data/bank/desc.html 7http://www.cs.toronto.edu/ delve/data/comp-activ/desc.html 8http://lib.stat.cmu.edu/datasets/ 9Our experimental policy selects unseen features useful for prediction, which is designed to evaluate LUF s efﬁcacy. We also conduct experiments on unseen features that are uninformative or noisy. We observe that LUF is quite robust to such features (performance drops less than 2%). LUF often stops using these features because either there is a minor decrease in the objective function or early stopping is triggered. 10All target-domain data are used in the learning process. We also tested our methods on hold-out evaluation data from the target domain and observed consistent performance.

Proceedings of the Twenty-Sixth International Joint Conference on Artiﬁcial Intelligence (IJCAI-17)

Table 1: Prediction error (RMSE) on regression datasets

Dataset Unseen feat. ID R R-ZKR R-ZNN R-Z* LUF Improv.(%) Abalone 1 2.42 0.08 2.33 0.076 2.31 0.064 2.28 0.08 2.28 0.01 1.3 1 0.12 0.00 0.11 0.01 0.11 0.00 0.12 0.00 -3.7 Bank 1,2 0.15 0.00 0.16 0.01 0.15 0.01 0.034 0.00 0.13 0.00 17.8 1,2,3 0.15 0.00 0.16 0.00 0.16 0.01 0.14 0.00 4.6 1 8.34 0.20 9.17 0.21 6.81 0.13 5.35 0.60 21.44 CPU 1,2 8.37 0.19 9.15 0.26 6.23 0.18 3.63 0.089 5.79 0.56 7.1 1,2,3 8.59 0.18 8.74 0.24 5.75 0.44 5.39 0.48 6.3 1 10.17 1.12 10.14 1.11 9.55 1.18 6.82 0.055 28.6 House 1,2 10.77 1.29 10.49 1.03 8.52 0.86 6.66 0.25 6.90 0.054 19.1 1,2,3 12.60 1.48 12.22 1.35 9.60 1.15 7.83 0.088 18.4 Note: the results of House are in the scale of 105.

This demonstrates that LUF is able to exploit useful information in the additional features. In a few cases, LUF performs slightly worse than the best baseline, but the gap is much smaller than that in successful cases. This reveals that although LUF may overﬁt due to its unsupervised nature, it is quite robust. It is interesting to study the improvement of LUF over baselines when the number of unseen features increases. With more unseen features, baselines tend to become worse but LUF has more useful features to leverage, which may lead to greater improvement, e.g. on Bank [1,2]. On the other hand, more unseen features makes the remaining ones less useful. LUF can suffer from this since the sample distances become less meaningful, e.g. on Bank [1,2,3] and CPU [1,2,3]. In practice, this may not be a severe problem because the features in the source domain are often reasonably good to begin with.

3.2 Results on Classiﬁcation Datasets We experiment with three classiﬁcation datasets USPS, which recognizes handwriting digits from images, contains 9,298 samples from 10 classes [Hull, 1994]. Books, which performs sentiment analysis on book reviews from Amazon, contains 4,000 samples from 2 classes [Blitzer et al., 2006]. Webcam, which recognizes objects in low-resolution images taken by web cameras, contains 795 samples from 10 classes [Kulis et al., 2011]. For Books and Webcam, we split the data into the source/target domain, each with half the number of samples. For USPS, the source/target domain uses 1,500 samples each. For all datasets, we scale each feature to [0,1], and then use principal component analysis (PCA) to reduce the dimensionality to 100, which reduces computational cost and feature noise. Similar to regression tasks, we check each feature as follows: if removing it increases the classiﬁcation error by more than 10 percent, then it is selected as an unseen feature. Table 2 shows the results when the top 1 to 3 unseen features are explored. For each dataset, 10 random splits are used and averaged classiﬁcation errors are reported. C-ZKR and C-ZNN perform better than C in all cases. This suggests that

ˆz can be estimated fairly well for these datasets. LUF outperforms C-ZKR and C-ZNN in most cases, with average improvement 3.6%. When the number of unseen features increases, LUF shows greater improvement over baselines as estimating z becomes harder.

3.3 Sensor Adaptation for Weather Stations

We apply our proposed approach to an application for sensor adaptation. In particular, we investigate how sensors at weather stations can be adapted when sensor failure happens. In a real-world environment, sensor failure often occurs at weather stations [Dereszynski and Dietterich, 2011] and can cause problems for system modules relying on the failed sensor. We aim to develop a robust system that automatically reconstructs the missing signal from the working ones. Missing signal reconstruction can be cast as a regression task, and a regression model can be built based on historical sensor readings. The reconstruction quality depends a lot on the correlations between the missing signal and the remaining ones. For example, reconstructing temperature from dew point and humidity can be fairly accurate but reconstructing wind speed from dew point and humidity is very hard. Our system handles this challenge by allowing a weather station to access the sensors at a nearby station when a sensor failure is detected. We conduct experiments with data from Weather Underground,11 which contains sensor data from a large number of personal weather stations worldwide. Each station consists of 5-10 sensors and produces a sample (combined readings form all sensors) at a ﬁxed time interval (e.g. every 10 minutes). We examine three stations in San Francisco, San Jose, and New York, respectively, as well as their nearby stations. For a given station, we use 3,000 samples randomly drawn from Jan. 2015 to Aug. 2015 as the source domain, and 3,000 random samples from Jan. 2016 to Aug. 2016 as the target domain. Table 3 summarizes the results. The reported cases correspond to signals that are difﬁcult to reconstruct using a weather station s own sensors.12 LUF achieves positive im-

11http://www.wunderground.com 12Note that precipitation data are not available for SF-A and SFB, and wind speed and wind gust data are not available for NY-A and NY-B.

Proceedings of the Twenty-Sixth International Joint Conference on Artiﬁcial Intelligence (IJCAI-17)

Table 2: Classiﬁcation error rate (%) on classiﬁcation datasets

Dataset Unseen feat. ID C C-ZKR C-ZNN C-Z* LUF Improv.(%) 1 10.9 0.08 9.2 0.2 8.1 0.1 8.0 0.2 1.1 USPS 1,2 14.9 0.1 13.4 0.2 8.5 0.2 7.8 0.08 8.2 0.2 3.5 1,2,3 19.3 0.1 16.3 0.2 9.3 0.2 8.7 0.2 7.1 1 24.9 0.4 24.2 0.3 23.4 0.3 23.8 0.3 -1.7 Books 1,2 26.5 0.3 25.6 0.4 24.9 0.3 22.6 0.2 24.1 0.3 3.6 1,2,3 28.7 0.3 27.8 0.3 25.9 0.3 24.6 0.2 5.0 1 36.7 0.2 35.4 0.3 35.4 0.3 34.8 0.4 1.7 Webcam 1,2 38.5 0.3 37.5 0.3 37.4 0.3 33.2 0.2 35.3 0.3 6.3 1,2,3 40.5 0.4 39.1 0.4 38.3 0.3 36.2 0.3 5.4

Table 3: Prediction error (RMSE) on weather data

Stations Missing Signal R R-ZKR R-ZNN R-Z* LUF Imp.(%) wind speed 5.80 0.024 5.94 0.051 5.76 0.030 5.13 0.019 5.93 0.032 -2.9 SF-A : SF-B wind gust 10.52 0.059 10.76 0.20 10.45 0.18 7.94 0.045 9.70 0.068 7.2 pressure 4.53 0.23 4.93 0.25 4.60 0.35 0.32 0.029 1.52 0.25 66.4 wind speed 3.76 0.082 3.94 0.15 3.78 0.066 1.11 0.018 3.74 0.053 0.51 wind gust 4.41 0.083 4.38 0.092 4.37 0.10 1.10 0.013 4.16 0.045 4.8 SJ-A : SJ-B pressure 4.01 0.021 4.03 0.10 3.86 0.079 0.15 0.012 1.95 0.068 49.5 precipitation 0.57 0.034 0.56 0.082 0.61 0.12 0.07 0.014 0.46 0.062 17.9 pressure 11.40 0.14 11.43 1.17 10.34 0.019 0.43 0.022 9.74 0.21 5.8 NY-A : NY-B precipitation 3.17 0.092 3.96 0.14 4.19 0.26 0.68 0.032 2.82 0.15 11.5 Note: the results are reported in the following units: mph (wind speed), mph (wind gust), in (pressure), in (precipitation). The listed weather stations correspond to the following station IDs in Weather Underground: SF-A (KCASANFR142), SF-B (KCASANFR114), SJ-A (KCASANJO121), SJ-B (KCASANJO139), NY-A (KNYNEWYO139), NY-B (KNYNEWYO132).

provement in 8 out of 9 cases, with most signiﬁcant improvement (66.7%) on reconstructing pressure for SF-A. Note that the amount of improvement is consistent with the gap between R-Z* and other baselines. When the gap is large, LUF leverages better features and outperforms other baselines with a large margin. When the gap is small (wind speed at SF-A), LUF performs worse than R due to overﬁtting, but with a fairly small performance drop. Figure 1 visualizes the joint distributions over original features and predicted labels, where we choose wind speed as a representative feature and pressure as the label, on station SFA. As can be observed, R generates a signiﬁcantly different joint distribution compared to the actual one, while LUF produces a much closer distribution. R-Z* performs even better but relies on the labels from the target domain.

4 Related Work

Our learning setting can be viewed as a special case of heterogeneous domain adaptation [Pan and Yang, 2010], which adapts a learning model across different feature spaces. The work of [Zhao and Hoi, 2010] and [Hou and Zhou, 2016] also consider new features in the target domain. However, their approaches require labels from the target domain, which does not work for our setting. A similar nearest-neighbor-based objective function is used in [Kulis et al., 2011]. However, their neighbors are ﬁxed but ours are dynamically updated. More importantly, our approach computes distances over both features and labels. Another way to address our setting is to treat it as a missing data problem: the additional features available in the tar-

get domain are completely missing in the source domain. The baselines in our empirical study are also approaches for missing value imputation [Grzymala-Busse and Hu, 2000; Lakshminarayan et al., 1996; Batista et al., 2002]. However, recovering these features in the source domain is very chal-

0.2 0.2 0.6 1 1.2

(a) ground truth

0.2 0.2 0.6 1 1.2 0.4

0.2 0.2 0.6 1 1.2 0.4

0.2 0.2 0.6 1 1.2 0.4

x-axis represents wind speed (feature), and y-axis represents pressure (label) given by different approaches. Ground truth corresponds to the actual pressure. Values are in normalized scales.

Figure 1: Visualization of wind speed and predicted pressure on weather station SF-A

Proceedings of the Twenty-Sixth International Joint Conference on Artiﬁcial Intelligence (IJCAI-17)

lenging in nature. Our setting is related to zero-shot learning [Larochelle et al., 2008; Farhadi et al., 2009] which explores unseen classes at test time, as well as to privileged information-based learning [Vapnik and Vashist, 2009] which leverages additional information only available in training. Although related, our setting is very different from theirs.

5 Conclusion

We presented a novel machine learning approach that leverages previously unseen features in the training set. The approach is applicable to both classiﬁcation and regression tasks. Supported by our empirical results, the approach can be used to improve a learning model when new features are accessible. Our future work includes theoretical analysis and more real-world applications. We also plan to develop algorithms that can automatically determine when to explore new features and which features to select from a large pool of features in an open environment.

Acknowledgements

This material is based upon work supported by the United States Air Force and the Defense Advanced Research Projects Agency (DARPA) under Contract No. FA8750-16-C-0045. Any opinions, ﬁndings and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reﬂect the views of the United States Air Force and DARPA.

[Argyriou et al., 2008] Andreas Argyriou, Andreas Maurer, and Massimiliano Pontil. An algorithm for transfer learning in a heterogeneous environment. In Joint European Conference on Machine Learning and Knowledge Discovery in Databases, pages 71 85. Springer, 2008.

[Batista et al., 2002] Gustavo EAPA Batista, Maria Carolina Monard, et al. A study of k-nearest neighbour as an imputation method. HIS, 87(251-260):48, 2002.

[Bishop, 2006] Christopher M Bishop. Pattern recognition. Machine Learning, 128:1 58, 2006.

[Blitzer et al., 2006] John Blitzer, Ryan Mc Donald, and Fernando Pereira. Domain adaptation with structural correspondence learning. In Proceedings of the 2006 conference on empirical methods in natural language processing, pages 120 128. Association for Computational Linguistics, 2006.

[Dai et al., 2008] Wenyuan Dai, Yuqiang Chen, Gui-Rong Xue, Qiang Yang, and Yong Yu. Translated learning: Transfer learning across different feature spaces. In Advances in neural information processing systems, pages 353 360, 2008.

[Daume III and Marcu, 2006] Hal Daume III and Daniel Marcu. Domain adaptation for statistical classiﬁers. Journal of Artiﬁcial Intelligence Research, 26:101 126, 2006.

[Dereszynski and Dietterich, 2011] Ethan W Dereszynski and Thomas G Dietterich. Spatiotemporal models for data-anomaly detection in dynamic environmental monitoring campaigns. ACM Transactions on Sensor Networks (TOSN), 8(1):3, 2011. [Duan et al., 2012] Lixin Duan, Dong Xu, and Ivor W Tsang. Learning with augmented features for heterogeneous domain adaptation. In Proceedings of the 29th International Conference on Machine Learning (ICML-12), pages 711 718, 2012. [Farhadi et al., 2009] Ali Farhadi, Ian Endres, Derek Hoiem, and David Forsyth. Describing objects by their attributes. In Computer Vision and Pattern Recognition, 2009. CVPR 2009. IEEE Conference on, pages 1778 1785. IEEE, 2009. [Grzymala-Busse and Hu, 2000] Jerzy W Grzymala-Busse and Ming Hu. A comparison of several approaches to missing attribute values in data mining. In International Conference on Rough Sets and Current Trends in Computing, pages 378 385. Springer, 2000. [Harel and Mannor, 2010] Maayan Harel and Shie Mannor. Learning from multiple outlooks. 2010. [Hou and Zhou, 2016] Chenping Hou and Zhi-Hua Zhou. One-pass learning with incremental and decremental features. ar Xiv preprint ar Xiv:1605.09082, 2016. [Hull, 1994] Jonathan J. Hull. A database for handwritten text recognition research. IEEE Transactions on pattern analysis and machine intelligence, 16(5):550 554, 1994. [Kulis et al., 2011] Brian Kulis, Kate Saenko, and Trevor Darrell. What you saw is not what you get: Domain adaptation using asymmetric kernel transforms. In Computer Vision and Pattern Recognition (CVPR), 2011 IEEE Conference on, pages 1785 1792. IEEE, 2011. [Lakshminarayan et al., 1996] Kamakshi Lakshminarayan, Steven A Harp, Robert P Goldman, Tariq Samad, et al. Imputation of missing data using machine learning techniques. In KDD, pages 140 145, 1996. [Larochelle et al., 2008] Hugo Larochelle, Dumitru Erhan, and Yoshua Bengio. Zero-data learning of new tasks. In AAAI, volume 1, page 3, 2008. [Pan and Yang, 2010] Sinno Jialin Pan and Qiang Yang. A survey on transfer learning. IEEE Transactions on knowledge and data engineering, 22(10):1345 1359, 2010. [Quionero-Candela et al., 2009] Joaquin Quionero-Candela, Masashi Sugiyama, Anton Schwaighofer, and Neil D Lawrence. Dataset shift in machine learning. The MIT Press, 2009. [Shi et al., 2010] Xiaoxiao Shi, Qi Liu, Wei Fan, S Yu Philip, and Ruixin Zhu. Transfer learning on heterogenous feature spaces via spectral transformation. In 2010 IEEE international conference on data mining, pages 1049 1054. IEEE, 2010. [Socher et al., 2013] Richard Socher, Milind Ganjoo, Christopher D Manning, and Andrew Ng. Zero-shot

Proceedings of the Twenty-Sixth International Joint Conference on Artiﬁcial Intelligence (IJCAI-17)

learning through cross-modal transfer. In Advances in neural information processing systems, pages 935 943, 2013. [Vapnik and Vashist, 2009] Vladimir Vapnik and Akshay Vashist. A new learning paradigm: Learning using privileged information. Neural Networks, 22(5):544 557, 2009. [Wang and Mahadevan, 2011] Chang Wang and Sridhar Mahadevan. Heterogeneous domain adaptation using manifold alignment. In IJCAI Proceedings-International Joint Conference on Artiﬁcial Intelligence, volume 22, page 1541, 2011. [Wei and Pal, 2011] Bin Wei and Christopher J Pal. Heterogeneous transfer learning with rbms. In AAAI, 2011. [Yeh et al., 2014] Yi-Ren Yeh, Chun-Hao Huang, and Yu Chiang Frank Wang. Heterogeneous domain adaptation and classiﬁcation by exploiting the correlation subspace. IEEE Transactions on Image Processing, 23(5):2009 2018, 2014. [Zhao and Hoi, 2010] Peilin Zhao and Steven C Hoi. Otl: A framework of online transfer learning. In Proceedings of the 27th International Conference on Machine Learning (ICML-10), pages 1231 1238, 2010. [Zhou et al., 2014] Joey Tianyi Zhou, Sinno Jialin Pan, Ivor W Tsang, and Yan Yan. Hybrid heterogeneous transfer learning through deep learning. In AAAI, pages 2213 2220, 2014. [Zhu, 2005] Xiaojin Zhu. Semi-supervised learning literature survey. 2005.

Proceedings of the Twenty-Sixth International Joint Conference on Artiﬁcial Intelligence (IJCAI-17)