# unsupervised_feature_learning_from_time_series__f0d0c465.pdf Unsupervised Feature Learning from Time Series Qin Zhang , Jia Wu , Hong Yang , Yingjie Tian , , Chengqi Zhang Quantum Computation & Intelligent Systems Centre, University of Technology Sydney, Australia Research Center on Fictitious Economy & Data Science, Chinese Academy of Sciences, Beijing, China. Key Lab of Big Data Mining & Knowledge Management, Chinese Academy of Sciences, Beijing, China. Math Works, Beijing, China {qin.zhang@student.,jia.wu@, chengqi.zhang@}uts.edu.au, hong.yang@mathworks.cn, tyj@ucas.ac.cn In this paper we study the problem of learning discriminative features (segments), often referred to as shapelets [Ye and Keogh, 2009] of time series, from unlabeled time series data. Discovering shapelets for time series classification has been widely studied, where many search-based algorithms are proposed to efficiently scan and select segments from a pool of candidates. However, such types of search-based algorithms may incur high time cost when the segment candidate pool is large. Alternatively, a recent work [Grabocka et al., 2014] uses regression learning to directly learn, instead of searching for, shapelets from time series. Motivated by the above observations, we propose a new Unsupervised Shapelet Learning Model (USLM) to efficiently learn shapelets from unlabeled time series data. The corresponding learning function integrates the strengths of pseudo-class label, spectral analysis, shapelets regularization term and regularized least-squares to auto-learn shapelets, pseudo-class labels and classification boundaries simultaneously. A coordinate descent algorithm is used to iteratively solve the learning function. Experiments show that USLM outperforms searchbased algorithms on real-world time series data. 1 Introduction Time series classification has wide applications in finance [Ruiz et al., 2012], medicine [Hirano and Tsumoto, 2006] and trajectory analysis [Cai and Ng, 2004]. The main challenge of time series classification is to find discriminative features that can best predict class labels. To solve the challenge, a line of works have been proposed to extract discriminative features, which are often referred to as shapelets, from time series. Shapelets are maximally discriminative features of time series which enjoy the merit of high prediction accuracy and are easy to explain [Ye and Keogh, 2009]. Therefore, discovering shapelets has become an important branch in time series analysis. The seminal work on shapelet discovery [Ye and Keogh, 2009] resorts to a full-scan of all possible time series segments where the segments are ranked according to a pre- Figure 1: The two blue thin lines represent the time series data. The upper one is rectangular signals with noise and the lower one is sinusoidal signals with noise. The curves marked in bold red are learnt features(shapelets). We can observe that the learnt shapelets may differ from all candidate segments and thus are robust to noise. defined distance metric and the segments which best predict the class labels are selected as shapelets. Based on the seminal work, a line of speed-up algorithms [Mueen et al., 2011] [Rakthanmanon and Keogh, 2013] [Chang et al., 2012] have been proposed to improve the performance. All these methods can be categorized as the search-based algorithms which scan a pool of candidate segments. For example, with the Synthetic Control dataset [Chen et al., 2015] which contains 600 time series examples of length 60, the number of candidates for all lengths is 1.098 106. On the other hand, a recent work [Grabocka et al., 2014] proposes a new time series shapelet learning approach. Instead of searching for shapelets from a candidate pool, they use regression learning and aim to learn shapelets from time series. This way, shapelets are detached from candidate segments and the learnt shapelets may differ from all the candidate segments. More importantly, shapelet learning is fast to compute, scalable to large datasets, and robust to noise as shown in Fig. 1. We present a new Unsupervised Shapelet Learning Model (USLM for short) that can auto-learn shapelets from unlabeled time series data. We first introduce pseudo-class label to transform unsupervised learning to supervised learning. Then, we use the popular regularized least-squares and Proceedings of the Twenty-Fifth International Joint Conference on Artificial Intelligence (IJCAI-16) spectral analysis approaches to learn both shapelets and classification boundaries. A new regularization term is also added to avoid similar shapelets being selected. A coordinate descent algorithm is used to iteratively solve the pseudo-class label, classification boundary and shapelets. Compared to the search-based algorithms on unlabeled time series data [Zakaria et al., 2012], our method provides a new tool that learns shapelets from unlabeled time series data. The contributions of the work are summarized as follows: We present a new Unsupervised Shapelet Learning Model (USLM) to learn shapelets from unlabeled time series data. USLM combines pseudo-class label, spectral analysis, shapelets regularization and regularized least-squares for learning. We empirically validate the performance of USLM on both synthetic and real-world data. The results show promising results compared to the state-of-the-art unsupervised shapelets selection models. The remainder of the paper is organized as follows: Section 2 surveys related work. Section 3 gives the preliminaries. Section 4 introduces the proposed unsupervised shapelet learning model USLM. Section 5 introduces the learning algorithm with analysis. Section 6 conducts experiments. We conclude the paper in Section 7. 2 Related Work Shapelets [Ye and Keogh, 2009] are time series short segments that can best predict class labels. The basic idea of shapelets discovery is to consider all segments of training data and assess them regarding a scoring function to estimate how predictive they are with respect to the given class labels [Wistuba et al., 2015]. The seminal work [Ye and Keogh, 2009] builds a decision tree classifier by recursively searching for informative shapelets measured by information gain. Based on information gain, several new measures such as F-Stat, Kruskall-Wallis and Mood s median are used in shapelets selection [Hills et al., 2014] [Lines et al., 2012]. Since time series data usually have a large number of candidate segments, the runtime of brute-force shapelets selection is infeasible. Therefore, a series of speed-up techniques have been proposed. On the one hand, there are smart implementations using early abandon of distance computations and entropy pruning of the information gain heuristic [Ye and Keogh, 2009]. On the other hand, many speed-ups rely on the reuse of computations and pruning of the search space [Mueen et al., 2011], as well as pruning candidates by searching possibly interesting candidates on the SAX representation [Rakthanmanon and Keogh, 2013] or using infrequent shapelets [He et al., 2012]. Shapelets have been applied in a series of real-world applications. Shapelet learning. Instead of searching for shapelets exhaustively, a recent work [Grabocka et al., 2014] proposes to learn optimal shapelets and reports statistically significant improvements in accuracy compared to other shapelet-based classifiers. Instead of restricting the pool of possible candidates to those found in the training data and simply searching them, they consider shapelets to be features that can be learnt through regression learning. This type of learning method does not consider a limited set of candidates but can obtain arbitrary shapelets. Unsupervised feature selection. Many unsupervised feature selection algorithms have been proposed to select informative features from unlabeled data. A commonly used criterion in unsupervised feature learning is to select features best preserving data similarity or manifold structure constructed from the whole feature space[Zhao and Liu, 2007] [Cai et al., 2010], but they fail to incorporate discriminative information implied within data, which cannot be directly applied in our shapelet learning problem. Earlier unsupervised feature selection algorithms evaluate the importance of each feature individually and select features one by one [He et al., 2005] [Zhao and Liu, 2007], with a limitation that correlation among features is neglected [Cai et al., 2010] citezhao2010efficient [Zhang et al., 2015]. State-of-the-art unsupervised feature selection algorithms perform feature selection by simultaneously exploiting discriminative information and feature correlation. Unsupervised Discriminative Feature Selection (UDFS) [Yang et al., 2011] aims to select the most discriminative features for data representation, where manifold structure is also considered. Since the most discriminative information for feature selection is usually encoded in labels, it is very important to predict a good cluster indicators as pseudo labels for unsupervised feature selection. Shapelets for clustering. Shapelets also have been utilized to cluster time series [Zakaria et al., 2012]. Zakaria et al. [Zakaria et al., 2012] have proposed a method to use unsupervised-shapelets (u-Shapelets) for time series clustering. The algorithm searches for u-Shapelets which can separate and remove a subset of time series from the rest of the dataset, then it iteratively repeats the search among the remaining data until no data remains to be separated. It is a greedy search algorithm which attempts to maximize the gap between the two groups of time series divided by a u-shapelet. The k-shape algorithm is proposed in the work [Paparrizos and Gravano, 2015] to cluster time series. k-shape is a novel algorithm for shape-based time series clustering that is efficient and domain independent. k-shape is based on a scalable iterative refinement procedure which creates homogeneous and well-separated clusters. Specifically, k-Shape requires a distance measure that is invariant to scaling and shifting. It uses a normalized version of the cross-correlation measure as distance measure to consider the shapes of time series. Based on the normalized cross-correlation, the method computes cluster centroids in every iteration to update the assignment of time series to clusters. Our work differs from the above research problems. We introduce a new approach for unsupervised shapelet learning to auto-learn shapelets from unlabeled time series by combining shapelet learning and unsupervised feature selection methods. 3 Preliminaries In this paper, scalars are denoted by letters (a, b, ...; , β, ...), vectors by lower-case bold letters (a, b, ...), and matrices by boldfaced upper-case letters (A, B, ...). We use a(k) to denote the k-th element of vector a, and A(ij) to denote the element locating at the ith-row and j-th column. A(i,:) and A(:,j) denote vectors of the i-th row and j-th column of the matrix respectively. In a time series example, ta,b denotes a segment starting from a to b. Consider a set of time series examples T = {t1, t2, . . . , tn}. Each example ti (1 i n) contains an ordered set of real values denoted as (ti(1), ti(2), . . . , ti(qi)), where qi is the length of ti. We wish to learn a set of top-k most discriminative shapelets S = {s1, s2, . . . , sk}. Similar to the shapelet learning model [Grabocka et al., 2014], we set the length of shapelets to expand r different length scales starting at a minimum lmin, i.e. {lmin, 2 lmin, . . . , r lmin}. Each length scale i lmin contains ki shapelets and k = Pr i=1 ki. Obviously, S 2 Sr i=1 Rki (i lmin) and r lmin qi to keep the shapelets compact. 4 Unsupervised shapelet learning In this section, we aim to formulate the Unsupervised Shapelet Learning Model (USLM) which is shown in Eq. (7). It combines with spectral regularization term, shapelet similarity regularization term and the regularized least square minimization term. To introduce the USLM specifically, we introduce shapelet-transformed representation [Grabocka et al., 2014] of time series first, which transfer time series from original space to a shapelet-based space. Then we introduce the pseudo-class label and the three terms of the unsupervised shapelet learning model respectively. Shapelet-transformed Representation Shapelet transformation was introduced by the work[Lines et al., 2012] to downsize a long time series into a short feature vector in the shapelets feature space. Time series are ordered sequences and shapelet-transformation can preserve the shape information for classification. Given a set of time series examples T = {t1, t2, . . . , tn} and a set of shapelets S = {s1, s2, . . . , sk}, we use X 2 Rk n to denote the shapelet-transformed matrix, where each element X(si,tj) denotes the distance between shapelet si and time series tj. For simplicity, we use X(ij) to represent X(si,tj) which can be calculated as in Eq. (1), X(ij) = min g=1,...,q (tj(g+h 1) si(h)) (1) where q = qj li + 1 denotes the total number of segments with length li from series tj, and qj, li are the lengths of time series tj and shapelet si respectively. Given a set of time series data S, X(ij) is a function with respect to all candidate shapelets S, i.e. X(S)(ij). For simplicity, we omit the variable S and still use X(ij) instead. The distance function in Eq. (1) is not continuous and thus non-differential. Based on the work [Grabocka et al., 2014], we approximate the distance function using the soft minimum function as in Eq. (2), q=1 dijq e dijq q=1 e dijq (2) where dijq = 1 h=1(tj(q+h 1) si(h)), and controls the precision of the function. The soft minimum approaches the true minimum when ! 1. In our experiments, we set = 100. Pseudo-class label Unsupervised learning faces the challenge of unlabeled training examples. Thus, we introduce the pseudo-class labels for learning. Consider that we cluster a time series data set into c categories, the pseudo-class label matrix Y 2 Rc n contains c labels, where Y(ij) indicates the probability of the j-th time series example belonging to the ith category. Time series examples that share the same pseudoclass label fall into the same category. If Y(ij) > Y(i,j), 8i, then the time series example tj belong to the cluster i. Spectral Analysis Spectral analysis was introduced by [Donath and Hoffman, 1973] and has been widely used in unsupervised learning [Von Luxburg, 2007]. The principle behind is that examples that are close to each other are likely to share the same class label [Tang and Liu, 2014] [Von Luxburg, 2007]. Assume that G 2 Rn n is the similarity matrix of time series based on the shapelet-transformed matrix X, then the similarity matrix can be calculated as in Eq. (3), where σ is the parameter of the RBF kernel. k X(:,i) X(:,j)k2 Based on G, we expect the pseudo-class labels of similar data instances to be the same. Therefore, we can formulate a spectral regularization term as follows, G(ij)k Y(:,i) Y(:,j)k2 G(ij)(Y(ki) Y(kj))2 Y(k,:)(DG G)Y(k,:) where LG = DG G is the Laplacian matrix and DG is a diagonal matrix with its elements defined as DG(i, i) = Pn j=1 G(ij). Least Square Minimization Based on the pseudo-class labels, we wish to minimize the least square error. Let W 2 Rk c be the classification boundary under the pseudoclass labels, the least square error minimizes the following objective function, W k WT X Yk2 Shapelet Similarity Minimization We wish to learn shapelets that are diverse in shape. Specifically, we penalize the model in case it outputs similar shapelets. Formally, we denote the shapelet similarity matrix as H 2 Rk k, where each element H(si,sj) represents the similarity between two shapelets si and sj. For simplicity, we use H(ij) to represent H(si,sj) which can be calculated as in Eq. (6), where dij is the distance between shapelet si and shapelet sj. dij can be calculated by following Eq. (2). Unsupervised Shapelet Learning Model Eq. (7) gives the unsupervised shapelet learning model (USLM). It is a joint optimization problem with respect to three variables, the classification boundary W, Pseudo-class label Y and candidate shapelets S. 1 2tr(YLG(S)Y>) + λ1 2 k WT X(S) Yk2 In the objective function, the first term is the spectral regularization that preserves local structure information. The second term is the shapelet similarity regularization term that prefers diverse shapelets. The third and fourth terms are the regularized least square minimization. Note that matrix LG in Eq. (4), matrix H in Eq. (6), and matrix X in Eq. (2) depend on the shapelets S. We explicitly write these matrices as variables with respect to shapelets S in Eq. (7), i.e. LG(S) , H(S) and X(S) respectively. Below, we propose a coordinate descent algorithm to solve the model. 5 The Algorithm In this part, we first introduce a coordinate descent algorithm to solve the USLM, and then analyze its convergence and initialization methods. 5.1 Learning Algorithm In the coordinate descent algorithm, we iteratively update one variable by fixing the remaining two variables. The steps will be repeated until convergence. Algorithm 1 summarizes the steps. Algorithm 1. Unsupervised Shapelet Learning Algorithm (USLA) 1: Input: Time series T with c classes Length and number of shapelets: lmin, r, k Number of internal iterations imax Learning rate and Parameters λ1, λ2, λ3 and , σ 2: Output: Shapelets S and class labels Y 3: Initialize: S0, W0, Y0 4: While Not convergent do 5: Calculate: Xt(T, St| ), LGt(T, St| , σ) 6: and Ht(St| ) based on Eqs. (2), (4), and (6); 7: update Wt+1, Yt+1: 8: Yt+1 λ2WT t Xt(LGt + λ2I) 1 9: Wt+1 (λ2Xt XT t + λ3I) 1(λ2Xt YT t+1). 10: update St+1: 11: for i = 1, . . . , imax do 12: Si+1 Si r Si 13: r Si = @F(Si|Xt+1,Yt+1) @S is from Eq. (17) 14: end for 15: St+1 = Simax+1 16: t t + 1 17: end while 18: Output: S = St; Y = Yt; W = Wt. Update Y by fixing W and S By fixing W and S, the function in Eq. (7) degenerates to Eq. (8), 2tr(YLGYT ) + λ2 2 k WT X Yk2 The derivative of Eq. (8) with respect to Y is, @Y = Y(LG λ2I) λ2WT X (9) Let Eq. (9) equal to 0, we can obtain the solution of Y as in Eq. (10), Y = λ2WT X(LG + λ2I) 1 (10) where I is an identity matrix. Thus, the update of Y is Yt+1 = λ2W> t Xt(LGt + λ2I) 1 (11) Update W by fixing S and Y By fixing S and Y, Eq. (7) degenerates to Eq. (12) as below, W F(W) = λ2 2 k WT X Yk2 The derivative of Eq. (12) with respect to W is, @W = (λ2XXT + λ3I)W λ2XYT (13) Let Eq. (13) equal to 0, we obtain the solution of W as in Eq. (14), W = (λ2XXT + λ3I) 1(λ2XYT ) (14) Thus, the update of W is Wt+1 = (λ2Xt X> t + λ3I) 1(λ2Xt YT Update S by fixing W and Y By fixing W and Y, Eq. (7) degenerates to Eq. (16) as below, 2tr(YLG(S)YT ) + λ1 2 k WT X(S) Yk2 Eq. (16) is non-convex and we cannot explicitly solve S as in finding W and Y. Instead, we resort to an iterative algorithm by setting a learning rate , i.e. Si+1 = Si r Si, where r Si = @F(Si) @S . The iterative steps will guarantee the convergence of the objective function. The derivative of Eq. (16) with respect to S(mp) is @F(S) @S(mp) 2YT Y@LG(S) + λ1H(S) @H(S) +λ2W(WT X Y ) @X(S) where m = 1, . . . , k, and p = 1, . . . , lm. Because LG = DG G and DG(i, i) = Pn j=1 G(ij), the first term in Eq. (17) turns to calculating @G(ij)/@S(mp) as shown in Eq. (18), @G(ij) @S(mp) (X(qi) X(qj))) @X(ij) @S(mp) e dijq((1 + dijq)E1 E2) @dijq (19) where E1 = Pqij q=1 e dijq, E2 = Pqij q=1 dijqe dijq and qij = qj li + 1 and dijq = 1 h=1(tj(q+h 1) si(h)). @dijq @S(mp) 0 if i 6= m 2 lm (S(mp) Tj,q+p 1) if i = m The second term in Eq. (17) turns to calculating Eq. (21), @H(ij) @S(mp) ij @ dij @S(mp) where dij is the distance between shapelets si and sj. The calculation of dij and @ dij @S(mp) is similar to X(ij) and @X(ij) @S(mp) respectively. To sum up, we can calculate the gradient r Si = @F(S) @Si by following Eqs. (17)-(21). In the following, we discuss the convergence of the coordinate descent algorithm in solving Eq. (7). 5.2 Convergence The convergence of Algorithm 1 depends on stepwise descents. When updating Y or W, we know that Yt+1 = Y (Wt, St) and Wt+1 = W (Yt+1, St). When updating S, the objective function in Eq. (16) is not convex and a closedform derivative is difficult to obtain. Instead, we use a gradient descent algorithm. In the iterations, as long as we set an appropriate learning rate that is usually very small, the objective function will decrease to convergence. In addition, the objective function in Eq. (7) is non-convex but have a lower bound of 0 due to being non-negative, so Algorithm 1 converges to local optima. In our experiments we run the algorithm several times under different initializations and choose the best solution as output. 5.3 Initialization Because the algorithm only outputs local optima, we discuss how to initialize the variables to improve the performance of the algorithm. The algorithm expects to initialize S0, Y0 and W0. We first initialize S0 by using the centroids of the segments having the same length with the shapelets length, because centroids represent typical patterns behind the data. Then, we use the results of the shapelet-transformed matrix of time series based on S0 to obtain X0. Next, the results obtained by k-means based on X0 is used to initialize W0 and Y0. The initialization enables fast convergence. 6 Experiments We conduct experiments to validate the performance of USLM. All experiments are conducted on a Windows 8 machine with 3.00GHz CPU and 8GB memory. The Matlab source codes and data are available online1. 6.1 Datasets Synthetic data: This dataset is generated by following the work [Shariat and Pavlovic, 2011] and [Zakaria et al., 2012]. The dataset consists of ten examples from two classes. The first class contains sinusoidal signals of length 200. The second class contains rectangular signals of length 100. We randomly embed Gaussian noise in the data. Heterogeneous noise is generated from five Gaussian processes with means 2 [0, 2] and variances σ2 2 [0, 10] chosen uniformly at random. Real-world data: We use seven time series benchmark datasets download from the UCR time series archive [Chen et al., 2015] [Cetin et al., 2015]. The datasets are summarized in Table 1. More details of the datasets can resort to their Website. Table 1: Statistics of the benchmark time series datasets Dataset Train/Test Length ] classes CBF 30/900(930) 128 3 ECG 200 100/100(200) 96 2 Face Four 24/88(112) 350 4 Ita.Pow.Dem. 67/1029(1096) 24 2 Lighting2 60/61(121) 637 2 Lighting7 70/73(143) 319 7 OSU Leaf 200/242(442) 427 6 6.2 Measures Existing measures used to evaluate the performance of time series clustering include Jaccard Score, Rand Index, Folkes and Mallow index [Halkidi et al., 2001] [Zakaria et al., 2012]. Among them, Rand index [Rand, 1971] is the most popular one, while the remaining measures can be taken as variants of Rand index. Therefore, we use Rand Index as the evaluation method. To calculate Rand index, we compare the cluster labels Y obtained by the clustering algorithm with the genuine class labels Ltrue as in Eq. (22), Rand index = TP + TN TP + TN + FP + FN , (22) where TP is the number of time series pairs which belong to the same class in Ltrue and are assigned to the same cluster in Y , TN is the number of time series pairs which belong to different classes in Ltrue and are assigned to different clusters in Y , FP is the number of time series pairs which belong to different classes in Ltrue but are assigned to the same cluster in Y , and FN is the number of time series pairs which 1https://github.com/Blind Review/shapelet belong to the same class in Ltrue but are assigned to different clusters in Y . If Rand index is close to 1, it indicates a high quality clustering [Zakaria et al., 2012] [Paparrizos and Gravano, 2015]. 6.3 Time series with unequal lengths Shapelet 1 Shapelet 1 Figure 2: An example of the rectangular signals(short blue thin curves) and the sinusoidal signals (long black thin curves). The two learned shapelets are marked with bold red curves. Fig. 2 shows an example of two shapelets learnt by USLM. Shapelet 1 is a sharp spike of length 12 which matches the most prominent spikes of the rectangular signals. Shapelet 2 is a subsequence of length 30 which is very similar to the standard shape of the sinusoidal signals. From the results, we can observe that USLM can auto-learn representative shapelets from unlabeled time series data. The results in Fig. 3 also show that USLM can handle time series and shapelets of unequal lengths, where we can obtain the best Rand index value of 1. In contrast, the work [Shariat and Pavlovic, 2011] obtains only 0.9 on the dataset even if they use class label information during training. 6.4 Running time We test the running time of USLM by changing the number of shapelets k and the number of clusters c. The results are given in Fig. 3. First, we vary the parameter k from 2 to 12 with a step size of 2. The remaining parameters are fixed as follows, λ1 = λ2 = λ3 = λ4 = 1, σ = 1, Imax = 50, = 0.01 and the length of the shapelets is set to 10. The running time is the average of ten executions. From Fig. 3(a), we can observe that the running time generally increases linearly with respect to the number of shapelets. Thus, our approach scales well to large datasets. Then, we let the number of clusters c change from 2 to 7. The length of shapelets is set to be 10% of the time series length. We set k = 2 which means that we only learn two shapelets of equal length. The remaining parameters are the same as above. Fig. 3(b) 2 4 6 8 10 12 0 Number of Shapelets Running Time(sec) CBF ECG 200 Face Four Ita Pow Dem Lighting2 Lighting7 OSU Leaf (a) # of shapelets 2 3 4 5 6 7 0 Number of Clusters Running Time(sec) CBF ECG 200 Face Four Ita Pow Dem Lighting2 Lighting7 (b) # of clusters Figure 3: USLM running time w.r.t. (a) the number of shapelets k and (b) the number of clusters c. The running time increases linearly w.r.t. the number of features and remains stable w.r.t. the number of clusters c. Thus, USLM scales well to large datasets. 6.5 Comparisons with k-Shape The k-Shape algorithm is proposed in the work [Paparrizos and Gravano, 2015] to cluster unlabeled time series. In this part, we compare our algorithm with the k-Shape algorithm. The source codes for k-Shape are downloaded from the website of the original authors. Table 2 lists the results. We select the best results obtained by k-Shape after 100 times of repeats. We can observe that the results obtained by USLM are better. Table 2: Comparison between k-Shape and USLM. Rand Index k-Shape USLM CBF 0.74 1.00 ECG200 0.70 0.76 Fac.F. 0.64 0.79 Ita.Pow 0.70 0.82 Lig.2 0.65 0.80 Lig.7 0.74 0.79 OSU L. 0.66 0.82 Average 0.69 0.83 Form Table 2, we can observe that USLM outperforms the k-Shape algorithm on all the seven datasets. The average improvement on the seven datasets is 14% and the best improvement is 26% on the CBF dataset. To sum up, the results show that USLM gains higher accuracy than k-Shape. 7 Conclusions In this paper, we investigated a new problem of feature learning from unlabeled time series data. To solve the problem, we proposed a new learning model USLM by combining the pseudo-class label, spectral analysis, shapelets regularization and regularized least-squares minimization. USLM can autolearn the most discriminative features from unlabeled times series data. Experiments on real-world time series data have shown that USLM can obtain an average improvement of 14% compared to k-Shape. Acknowledgments This work was supported by the Australian Research Council (ARC) Discovery Projects under Grant No. DP140100545 and DP140102206 and also been partially supported by grants from National Natural Science Foundation of China (Nos. 61472390 and 11271361). Y. Tian is the corresponding author. References [Cai and Ng, 2004] Yuhan Cai and Raymond Ng. Indexing spatio-temporal trajectories with chebyshev polynomials. In SIGMOD, pages 599 610, 2004. [Cai et al., 2010] Deng Cai, Chiyuan Zhang, and Xiaofei He. Unsupervised feature selection for multi-cluster data. In KDD, pages 333 342, 2010. [Cetin et al., 2015] Mustafa S Cetin, Abdullah Mueen, and Vince D Calhoun. Shapelet ensemble for multidimensional time series. SDM, pages 307 315, 2015. [Chang et al., 2012] Kai-Wei Chang, Bikash Deka, Wen- Mei W Hwu, and Dan Roth. Efficient pattern-based time series classification on gpu. In ICDM, pages 131 140, 2012. [Chen et al., 2015] Yanping Chen, Eamonn Keogh, Bing Hu, Nurjahan Begum, Anthony Bagnall, Abdullah Mueen, and Gustavo Batista. The ucr time series classification archive, www.cs.ucr.edu/ eamonn/time series data/. 2015. [Donath and Hoffman, 1973] William E Donath and Alan J Hoffman. Lower bounds for the partitioning of graphs. IBM Journal of Research and Development, 17(5):420 425, 1973. [Grabocka et al., 2014] Josif Grabocka, Nicolas Schilling, Martin Wistuba, and Lars Schmidt-Thieme. Learning time-series shapelets. In KDD, pages 392 401, 2014. [Halkidi et al., 2001] Maria Halkidi, Yannis Batistakis, and Michalis Vazirgiannis. On clustering validation techniques. Journal of intelligent information systems, 17(2):107 145, 2001. [He et al., 2005] Xiaofei He, Deng Cai, and Partha Niyogi. Laplacian score for feature selection. In NIPS, pages 507 514, 2005. [He et al., 2012] Qing He, Zhi Dong, Fuzhen Zhuang, Tian- feng Shang, and Zhongzhi Shi. Fast time series classification based on infrequent shapelets. In ICMLA, pages 215 219, 2012. [Hills et al., 2014] Jon Hills, Jason Lines, Edgaras Baranauskas, James Mapp, and Anthony Bagnall. Classification of time series by shapelet transformation. Data Mining and Knowledge Discovery, 28(4):851 881, 2014. [Hirano and Tsumoto, 2006] Shoji Hirano and Shusaku Tsumoto. Cluster analysis of time-series medical data based on the trajectory representation and multiscale comparison techniques. In ICDM, pages 896 901, 2006. [Lines et al., 2012] Jason Lines, Luke M Davis, Jon Hills, and Anthony Bagnall. A shapelet transform for time series classification. In KDD, pages 289 297, 2012. [Mueen et al., 2011] Abdullah Mueen, Eamonn Keogh, and Neal Young. Logical-shapelets: an expressive primitive for time series classification. In KDD, pages 1154 1162, 2011. [Paparrizos and Gravano, 2015] John Paparrizos and Luis Gravano. k-shape: Efficient and accurate clustering of time series. In SIGMOD, pages 1855 1870, 2015. [Rakthanmanon and Keogh, 2013] Thanawin Rakthanmanon and Eamonn Keogh. Fast shapelets: A scalable algorithm for discovering time series shapelets. In SDM, pages 668 676, 2013. [Rand, 1971] William M Rand. Objective criteria for the evaluation of clustering methods. Journal of the American Statistical association, 66(336):846 850, 1971. [Ruiz et al., 2012] Eduardo J Ruiz, Vagelis Hristidis, Carlos Castillo, Aristides Gionis, and Alejandro Jaimes. Correlating financial time series with micro-blogging activity. In WSDM, pages 513 522, 2012. [Shariat and Pavlovic, 2011] Shahriar Shariat and Vladimir Pavlovic. Isotonic cca for sequence alignment and activity recognition. In ICCV, pages 2572 2578, 2011. [Tang and Liu, 2014] Jiliang Tang and Huan Liu. An unsupervised feature selection framework for social media data. IEEE Transactions on Knowledge and Data Engineering, 26(12):2914 2927, 2014. [Von Luxburg, 2007] Ulrike Von Luxburg. A tutorial on spectral clustering. Statistics and computing, 17(4):395 416, 2007. [Wistuba et al., 2015] Martin Wistuba, Josif Grabocka, and Lars Schmidt-Thieme. Ultra-fast shapelets for time series classification. Journal of Data and Knowledge Engineering, 2015. [Yang et al., 2011] Yi Yang, Heng Tao Shen, Zhigang Ma, Zi Huang, and Xiaofang Zhou. l2,1-norm regularized discriminative feature selection for unsupervised learning. In IJCAI, pages 1589 1594, 2011. [Ye and Keogh, 2009] Lexiang Ye and Eamonn Keogh. Time series shapelets: a new primitive for data mining. In KDD, pages 947 956, 2009. [Zakaria et al., 2012] Jamaluddin Zakaria, Abdullah Mueen, and Eamonn Keogh. Clustering time series using unsupervised-shapelets. In ICDM, pages 785 794, 2012. [Zhang et al., 2015] Peng Zhang, Chuan Zhou, Peng Wang, Byron J Gao, Xingquan Zhu, and Li Guo. E-tree: An efficient indexing structure for ensemble models on data streams. IEEE Transactions on Knowledge and Data Engineering, 27(2):461 474, 2015. [Zhao and Liu, 2007] Zheng Zhao and Huan Liu. Spectral feature selection for supervised and unsupervised learning. In ICML, pages 1151 1157, 2007.