# hubness_change_point_detection__56aed3ed.pdf Hubness Change Point Detection Ikumi Suzuki1, Kazuo Hara1, Eiji Murakami2 1Yamagata University 2Azbil Kimmon Co.,Ltd., Keio University suzuki.ikumi@gmail.com, kazuo.hara@gmail.com, e.murakami.y8@azbil.com This study proposes a new change detection method that leverages hubness. Hubness is a phenomenon that occurs in high-dimensional spaces, where certain special data points, known as hub data, tend to be closer to other data points. Hubness is known to degrade the accuracy of methods based on nearest neighbor search. Therefore, many studies in the past have focused on reducing hubness to improve accuracy. In contrast, this study utilizes hubness to detect changes. Specifically, if there is no change, suppressing the hubness occurring in the two datasets obtained by dividing the time-series data will result in a uniform data distribution. However, if there is a change, even if we try to reduce the hubness in the two datasets obtained by dividing the time-series data before and after the change, the hubness will not be reduced, and the data distribution will not become uniform. We use this finding to detect changes. Experiments with synthetic data show that the proposed method achieves accuracy comparable to or exceeding that of existing methods. Additionally, the proposed method achieves good accuracy with real-world data from hydraulic systems and gas sensors, along with excellent runtime performance. Introduction This study proposes a novel change detection method. The changes targeted for detection in this study are those that occur in time-series data obtained by monitoring the state of machines. Examples of time-series data include data obtained from sensors monitoring the state of machines operating in factories (Yi, Huang, and Li 2017; Lei et al. 2018), and data obtained from meters measuring the usage of water or gas in households (Gungor et al. 2011). These data typically change due to intentional state changes, such as switching operating modes. However, they can also change due to unintentional state changes caused by machine failures. Missing these unintentional state changes can threaten the safety and reliability of the machines. Furthermore, it is desirable to perform change detection through edge computing on-site, without sending data to a cloud data center, and to trigger alarms as quickly as possible when changes occur. Therefore, developing a change detection algorithm that does not require extensive computation time is a crucial task. Copyright 2025, Association for the Advancement of Artificial Intelligence (www.aaai.org). All rights reserved. Existing change detection methods can be broadly classified into two categories: those that assume probabilistic models and those that are model-free. Methods based on probabilistic models first estimate a probabilistic model that generates data using past data. When data that is significantly unlikely according to the estimated probabilistic model appears consecutively, a change is considered to have occurred. Model-free methods, on the other hand, involve dividing time-series data into two groups before and after each point in time to determine whether a change has occurred. If a significant difference between the two groups is observed, a change is deemed to have occurred at that point. The new change detection method we propose is categorized as a model-free method. We focus on the hubness phenomenon that occurs in the two datasets obtained by dividing the time-series data. When using entries from one dataset as queries to perform nearest neighbor search on the entries from the other dataset, certain entries from the other dataset frequently appear in the search results for many queries, indicating the occurrence of the hubness phenomenon. To manage this hubness, we apply hubness reduction methods, specifically centering and vector length normalization. Importantly, if no changes occur and, as a result, there is no difference between the two datasets, hubness reduction will succeed, resulting in a uniform data distribution. Conversely, if changes occur and there is a difference between the two datasets, hubness will not be reduced, and the data distribution will not become uniform. Our contributions are as follows. Propose a novel method that applies the concept of hubness to change detection. Introduce a novel method for hubness suppression in detecting differences between two datasets and provide analytical proofs to support its effectiveness. Propose employing a uniformity test to measure hubness phenomena between two datasets, which advances effective change point detection. Develop the Hubness Change Point Detection (Hub CPD) algorithm and present comparative experiments against state-of-the-art approaches. Due to the absence of complex computations, our method achieves faster computation times and accuracy that is either comparable to or superior to existing methods. The Thirty-Ninth AAAI Conference on Artificial Intelligence (AAAI-25) Related Work There is a large amount of research on change detection, as seen in the literature such as (Aminikhanghahi and Cook 2017). The objectives of past research on change detection can be divided into two main types: segmentation of timeseries data, which involves dividing the given time-series data into contiguous intervals with no changes; and detecting changes in time-series data, which involves sounding an alarm when a change occurs in the incoming data. This study falls under the latter type, and therefore, we will describe it in detail below. Additionally, there is an intermediate type of research that aims to predict the distribution of the number of time points with no changes since the most recent change, known as the run length distribution, but these studies do not aim to sound an alarm (Adams and Mac Kay 2007). A traditional method for sounding an alarm in response to changes in time-series data is the CUSUM control chart (Page 1954). This method assumes a probabilistic model, that is, Gaussian distribution, for data generation under normal conditions, and it triggers an alarm when the cumulative sum of scores representing deviations from the normal state at each point exceeds a threshold. Following the CUSUM, change detection methods that assume data generation models have been developed, including autoregressive models (Yamanishi and Takeuchi 2002), state-space models (Kawahara, Yairi, and Machida 2007), and spatio-temporal models (Alanqary, Alomar, and Shah 2021). In contrast, modelfree change detection methods that do not require assumptions about the data generation model have been developed. These methods divide the recent time-series data into two subsequences and trigger an alarm when a difference between them is detected, indicating a change has occurred. Methods for identifying differences between the two subsequences include those based on density ratio (Liu et al. 2013), maximum mean discrepancy (MMD) (Chang et al. 2019), and random partitioning (Hooi and Faloutsos 2019). The proposed method in this study is a model-free change detection approach that leverages hubness to identify differences between two subsequences. It should be noted that the method based on density ratio requires matrix inversions, and the method based on MMD uses neural networks, both of which result in high computational costs for change detection. Hubness Suppression in Single Dataset Hubness is the phenomenon where special data points, those that are particularly close to many other data points, emerge in a dataset (Radovanovi c, Nanopoulos, and Ivanovi c 2010a,b; Radovanovi c 2015). Formally, let D be dataset consisting of d-dimensional vectors and Nk(x) be the count of how many times x D is in the k-nearest neighbors of q D\{x}. The presence of a right-skewed tail in the distribution of Nk(x) is referred to as the occurrence of hubness. The data points with large values of Nk(x) are termed hub data. Hubness is known to occur when there is a gradient in data density, and it becomes particularly prominent in highdimensional spaces (Low et al. 2013; Radovanovi c 2015; Hara et al. 2016). When measuring the proximity between data points using (squared) Euclidean distance dist(x, q) = ||x q||2, hub data tends to appear in areas where the data density is higher compared to the surrounding regions. For instance, in a dataset generated from a Gaussian distribution, data close to the center of the dataset becomes hub data. In a dataset with multiple clusters, data close to the cluster centers becomes hub data. When hubness occurs, it is known to degrade the accuracy of tasks such as classification and recommendation based on nearest neighbor search. To address this issue, methods for suppressing hubness have been developed (Schnitzer et al. 2012; Suzuki et al. 2012, 2013; Hara et al. 2015a,b, 2016; Flexer 2016). The centering methods modify the metric for measuring proximity between data points to prevent the emergence of hub data. This modification eliminates areas where data density is higher compared to the surroundings, leading to a more uniform data density. According to (Hara et al. 2016), the formal definition of centering squared distance is given below. Definition 1 (Centering the squared distance). Given x, q D and c = 1 |D| P x D x is the centroid of dataset D. The dissimilarity between data points x and q is defined as, dissim(x, q) = ||x q||2 ||x c||2 ||q c||2. (1) Using Eq. (1) to measure the proximity between data points, the data density is modified to approach more uniformly, suppressing hubness (Hara et al. 2016). In the next section, we propose how to suppress hubs between two datasets. Hubness Suppression Between Two Datasets Techniques for suppressing hubness in single dataset already exist, but extending them to address hubness arising between two datasets has not been explored. In this section, we introduce the concept of hubness that arises between two distinct datasets consisting of d-dimensional vectors, and we propose a method to suppress it. The algorithm of change point detection we will propose in the next section leverages the approach described in this section. Previous studies on hubness assumed that the database sample x and the query sample q belong to the same dataset D. However, here, we introduce a new scenario where x belongs to the dataset Dx and q belongs to the dataset Dq, assuming that these are two distinct datasets. Under this scenario, let k NN(q) denote the set of x Dx that fall within the k-nearest neighbors of q Dq. We define 1(x k NN(q)) = 1 if x is within the k-nearest neighbors of q, and 1(x k NN(q)) = 0 otherwise. In this context, let Nk(x) = P q Dq 1(x k NN(q)). We refer to the phenomenon where Nk(x) becomes particularly large for some data x as hubness, and such data points are called hub data. When hubness occurs, hub data x is more likely to be included in the k-nearest neighbors of data q. In other words, on average, the dissimilarity between data x and data q is small, or the similarity is large. Therefore, hubness is considered to be suppressed when, on average, the dissimilarity or similarity between any two data a, b Dx and q Dq becomes equal. Next, we define the following metric for measuring the proximity between data points to suppress hubness: Definition 2 (Centering the squared distance for hubness suppression between two datasets). Given x Dx, q Dq and cx = 1 |Dx| P x Dx x and cq = 1 |Dq| P q Dq q are the centroids of datasets Dx and Dq, respectively. We define the dissimilarity between data points x and q as, dissim(x, q) = ||x q||2 ||x cx||2 ||q cq||2 + ||cx cq||2. (2) Note that if Dx and Dq are the same dataset, then cx = cq, and hence, Eq. (2) coincides with Eq. (1). Importantly, Eq. (2) is simplified through a straightforward mathematical transformation to: dissim(x, q) = x q, x q x cx, x cx q cq, q cq + cx cq, cx cq = 2 x, q + 2 x, cx + 2 q, cq 2 cx, cq = 2 x cq, q cx . Furthermore, by reversing the sign and dividing by 2, we convert it into the similarity measure as shown in the following definition : Definition 3 (Centering the inner product similarity for hubness suppression between two datasets). Given x Dx, q Dq and cx = 1 |Dx| P x Dx x and cq = 1 |Dq| P q Dq q are the centroids of datasets Dx and Dq. We define the similarity between data points x and q as, sim(x, q) = x cq, q cx . (3) Eq.(3) calculates the similarity between x and q by transforming them into the difference vectors with respect to the centroids of the datasets Dq and Dx, to which they do not belong, and then computing the inner product of these difference vectors. We show in Proposition 1 that using the similarity in Eq. (3) suppresses hubness. That is, hubness is suppressed when the differences in similarity between any two data points a, b Dx and q Dq, become equal on average. Proposition 1. Applying Eq.(3), if Dx and Dq are datasets generated from the same distribution, hubness is suppressed. Conversely, if Dx and Dq are datasets generated from different distributions, there is no guarantee that hubness will be suppressed. Proof. Let x Dx and q Dq be generated according to the probability distributions P(x) and P(q), respectively. We denote the means of P(x) and P(q) by µx and µq, and the sample means of the datasets Dx and Dq by cx and cq, respectively. For two distinct data points a, b Dx, when the similarity with q Dq is calculated using the standard inner product instead of the similarity defined in Eq. (3), the difference between them is expressed by the following formula: = a, q b, q = a b, q . The difference is a function of q. Since q is generated according to the probability distribution P(q), the expectation of the difference is calculated as follows: Eq[ ] = Eq[ a b, q ] = a b, µq . From this equation, it shows that if µq is not the zero vector, the difference is, on average, not zero. In other words, this implies that either a or b could become a hub data, as the similarity with q Dq tends to be larger on average. Now, we will show that when using the similarity defined in Eq. (3) instead of the standard inner product, hubness is suppressed if Dx and Dq are generated according to the same probability distribution. First, let cent denote the difference in similarity between q Dq and a, b Dx when computed using Eq. (3). Then, cent is given by the following equation: cent = a cq, q cx b cq, q cx = a b, q cx = a b, q a b, cx = a b, cx . Then, the expectation of cent is given by: Eq[ cent] = Eq[ ] a b, cx = a b, µq a b, cx = a b, µq cx . If the datasets Dx and Dq are generated from the same distribution, then the sample mean cx of Dx and the mean µq of P(q) become approximately equal. Therefore, we obtain the following: Eq[ cent] = a b, µq cx 0. From this equation, the difference cent is approximately equal to zero on average. In other words, using the similarity defined in Eq. (3), the similarity between q Dq and the two data points a, b Dx becomes approximately equal on average, indicating that hubness is suppressed. Conversely, if the datasets Dx and Dq are generated from different distributions, cx and µq are not necessarily equal. As a result, the difference cent is not guaranteed to be zero on average. This implies that hubness is not necessarily suppressed. Uniformity Test for Measuring Hubness In this section, we propose the uniformity test for measuring hubness. In previous studies (Radovanovi c, Nanopoulos, and Ivanovi c 2010a), the occurrence of hubness was measured by the skewness of the distribution of the frequency with which a data point falls into the k-nearest neighbors of other data points (denoted as Nk(x)). Smaller skewness indicates that hubness is suppressed. However, skewness, as a third-order moment, tends to be unstable. In the previous section, we have shown that using the similarity measure defined in Eq. (3) can reduce hubness. Here, we further mitigate hubness by normalizing the vector lengths to 1. It is known that when using inner product similarity, hub data tends to appear at the boundary of the dataset, rather than at the center, in the hubness that arises (a) Settings with different µ (b) Settings with different σ Figure 1: Rayleigh score obtained by repeating the generation of Dx and Dq 10,000 times for each setting. The gray shading represents the range of the mean standard deviation. within a single dataset (Low et al. 2013; Suzuki and Hara 2017). Here, the boundary refers to the edge of the region supported by the data distribution. For example, in datasets generated from Gaussian or t-distribution, data that is notably distant from the center of the dataset and is essentially considered outliers becomes hub data. In contrast, hubness does not occur when a dataset does not have a boundary and density gradient. The situation without a boundary and density gradient is obtained when the dataset is distributed on a unit hyper-sphere. Therefore, we propose to normalize the length of the difference vectors, x cq and q cx to 1 in Eq. (3), as shown below: Definition 4 (Normalizing the centered inner product similarity for hubness suppression between two datasets). Given x Dx, q Dq and cx = 1 |Dx| P x Dx x and cq = 1 |Dq| P q Dq q are the centroids of datasets Dx and Dq. We define the similarity between data points x and q as, sim(x, q) = (x cq) ||x cq||, (q cx) ||q cx|| . (4) Using Eq. (4), if Dx and Dq are datasets generated from the same distribution, hubness is suppressed. For instance, if Dx and Dq follow the same isotropic Gaussian distribution, (x cq)/||x cq|| becomes a uniform distribution on a unit hyper-sphere without density gradient. Similarly, (q cx)/||q cx|| also becomes a uniform distribution on a unit hyper-sphere. Therefore, in this case, hubness is suppressed. In contrast, if Dx and Dq are datasets generated from different distributions, it is not assured that (x cq)/||x cq|| and (q cx)/||q cx|| become uniform distributions on a unit hyper-sphere, and thus hubness is not suppressed. The above mentioned relation between hubness suppression and uniformity can be observed through the following experiment. Consider the case where x Dx (|Dx| = 50) is generated from a 10-dimensional Gaussian distribution N(0, I), and q Dq (|Dq| = 50) is generated from N(µ1, I). Here, I is the 10 10 identity matrix, and 1 is a 10-dimensional all-ones vector. In this case, we investigate whether (x cq)/||x cq|| is uniformly distributed on a unit hyper-sphere using the score calculated by Rayleigh s method, namely, the length of the mean vector of (x cq)/||x cq||. The results are plotted with µ on the horizontal axis and the Rayleigh score on the vertical axis in Figure 1(a). Additionally, we illustrate the results when x Dx is generated from N(0, I) and q Dq is generated from N(0, σ2I) in Figure 1(b). The results show that the larger µ or σ is, indicating a greater divergence between the distributions of Dx and Dq, the Rayleigh score increases, that is, hubness is not suppressed. Note that in this section, we have explained the proposal using the centroids of the datasets, cx and cq. However, by replacing them with local centroids, cx(q) = 1 k P x Dknn x (q) x and cq(x) = 1 k P q Dknn q (x) q, where Dknn x (q) is the set of x Dx that are k-nearest neighbors of q, and Dknn q (x) is the set of q Dq that are k-nearest neighbors of x, we can adapt to a more general case where the dataset is generated from a non-unimodal distribution. Proposed Hub-CPD Algorithm In this section, we propose a novel change point detection algorithm, which we refer to as Hub-CPD, consisting of three steps: preprocessing, hubness suppression, and uniformity test, utilizing sliding window approach. Step 1: Preprocessing Let us denote multivariate time-series data with m variables as y1, y2, , yt Rm. Our proposed algorithm, as well as the state-of-the-art non-parametric methods for change point detection (the Ru LSIF and KL-CPD methods), converts time-series data into sub-series that overlap using a sliding window of size w. As a result, we obtain a set of mw-dimensional vectors. For time-series data from time t (n + w 2) to the current time t, denoted as yt (n+w 2), , yt, the Hankel matrix Y is constructed as follows: yt (n 1) yt (n 2) . . . yt yt (n) yt (n 1) . . . yt 1 ... ... ... ... yt (n+w 2) yt (n+w 3) . . . yt (w 1) We then obtain n column vectors of the matrix Y , that is, vt (n 1), , vt, which are n sub-series represented as mwdimensional vectors: yt (n 1) yt (n) ... yt (n+w 2) , . . . , vt = yt yt 1 ... yt (w 1) Given time-series data, this conversion is useful because we are interested in sub-series rather than the whole: To detect a change in time-series data, we detect the difference between two sets of sub-series, each of which corresponds to those before and after the time when the change occurred. To be more precise, by converting time-series data to mw-dimensional vectors using a sliding window, we reduce the problem of detecting changes in time-series data to the problem of finding difference between two sets of (a) In the case diff=1.0. (b) In the case diff=0.0. Figure 2: An example of transforming data using the Residual Normalization module. The original data, D and D , are generated from two Gaussian distributions with a mean difference of diff on each axis. (a) There is a difference between D and D . (b) There is no difference between D and D . mw-dimensional vectors, which we will denote as D and D hereafter. Step 2: Hubness Suppression To investigate whether D = {v1, . . . , vn} and D = {v 1, . . . , v n} are generated from the same distribution, we utilize the hubness suppression technique described in Eq.(4). Initially, for each vi D, we calculate the local centroid in D , denoted as v (knn) i . This local centroid is the mean vector of v (1) i , v (2) i , . . . , v (k) i D , representing the k-nearest neighbors of vi. Next, we compute the residual vector ri = vi v (knn) i obtained by subtracting the local centroid from vi. Furthermore, we normalize the residual vector to length 1, obtaining zi = ri/||ri||. The above process is applied similarly to v i D , obtaining z i = r i/||r i||. We present the module, which we call Residual Normalization, in Algorithm 1. Aside from the parameter k, the module takes two input datasets, each consisting of n vectors, and transforms them through k-nearest neighbor search, residual vector calculation, and vector normalization, producing the corresponding output. This module can be used repeatedly by connecting the output back to the input, allowing for iterative use. By repeating this process, the original datasets D and D undergo transformations, as demonstrated in Figure 2. In this example, dataset D consists of n = 50 vectors following a 10-dimensional standard Gaussian distribution, while dataset D consists of n = 50 vectors following a modified distribution in which each axis of the 10-dimensional standard Gaussian distribution is increased by diff. Figure 2 depicts a plot using the first two dimensions of the 10dimensional vectors. By applying the module, the residual vectors are normalized to a length of 1. When there is a difference between the two datasets (i.e., diff = 1.0), applying the module once results in Z (blue plot) and Z (light blue plot) being separated. Applying the module twice further separates Z (red plot) and Z (magenta plot). However, when there is no difference between the two datasets (i.e., Algorithm 1: Residual Normalization Input: D = {v1, . . . , vn}, D = {v 1, . . . , v n}, k Output: Z = {z1, . . . , zn}, Z = {z 1, . . . , z n} 1: for each vi D do 2: (i) Find k-nearest neighbors v (1) i , . . . , v (k) i D 3: (ii) Compute the residual vector ri = vi v (knn) i 4: where v (knn) i = 1 k Pk j=1 v (j) i 5: (iii) Normalize the residual vector as zi = ri/ ri 6: end for each 7: for each v i D do 8: (i) Find k-nearest neighbors v(1) i , . . . , v(k) i D 9: (ii) Compute the residual vector r i = v i v(knn) i 10: where v(knn) i = 1 k Pk j=1 v(j) i 11: (iii) Normalize the residual vector as z i = r i/ r i 12: end for each 13: return Z = {z1, . . . , zn}, Z = {z 1, . . . , z n} Algorithm 2: Proposed Hub-CPD Algorithm Input: D = {v1, . . . , vn}, D = {v 1, . . . , v n}, k, r Output: max(R, R ) 1: For each i = 1, . . . , r do 2: Z, Z = Residual Normalization(D, D , k) 3: D, D = Z, Z 4: end For each 5: Compute the Rayleigh score R = 1 n Pn i=1 zi 6: Compute the Rayleigh score R = 1 n Pn i=1 z i 7: return max(R, R ) diff = 0.0), applying the module does not result in the separation of Z and Z , and both remain evenly distributed on the unit hyper-sphere. As can be seen from this observation, the role of Residual Normalization is to extract the differences between the two datasets. By repeating this process, finer differences can be extracted. However, if the number of repetitions is too high, it may extract differences even when the two datasets are generated from the same distribution. Determining the appropriate number of repetitions is a subject for future work. Step 3: Uniformity Test To measure the uniformity of the vectors on a unit hypersphere, the proposed change detection algorithm uses the Rayleigh score (Rayleigh 1919), which is calculated as the norm of the mean of the normalized residual vectors: i=1 zi , R = 1 i=1 z i . (5) If the vectors {zi}n i=1 and {z i}n i=1 are uniformly distributed on a unit hyper-sphere, Pn i=1 zi and Pn i=1 z i will be close to zero vectors, and hence the values of R and R will be close to zero, respectively. Using the Rayleigh score, if at least one of R and R takes a large value, that is, if max(R, R ) is large, then the proposed method determines that there is a difference between D and D . Note that the proposed method, shown in Algorithm 2, can have different variants depending on the number of iterations, denoted as r, for the Residual Normalization module. Experiment on Artificial Data Using artificially generated time-series data, we compare the proposed Hub-CPD with four state-of-the-art methods, namely Ru LSIF (Liu et al. 2013), KL-CPD (Chang et al. 2019), Bn B (Hooi and Faloutsos 2019), and m SSA-MW (Alanqary, Alomar, and Shah 2021). We provide details on the parameter settings of these comparison methods in the Appendix A2. 1 Generation of Artificial Time-Series Data We use following two models to generate artificial datasets, both characterized by a mean µ and variance σ2. By adjusting the Shift Value, we control the magnitude of fluctuations in the generated data. (1) Jumping Mean: We fixed the variance σ2 = 1 and changed the mean µ alternately at each time of tchange, i.e., µ = 0 Shift Value 0 Shift Value, . (2) Scaling Variance: We fixed the mean µ = 0 and changed the variance σ2 alternately at each time of tchange, i.e., σ2 = 1 Shift Value + 1 1 Shift Value + 1, . The values of the model parameters were changed at every 200-time step intervals, that is, at tchange = 201, 401, 601, . . . , 19801. Given the parameter values, 20000 data points, {xt}20000 t=1 , were generated by the following probabilistic models: (i) i.i.d. Gaussian N(µ, σ2), (ii) i.i.d. t-distribution with location parameter µ, scale parameter σ, and 3 degrees of freedom, and (iii) AR model with xt = 0.9xt 1 + ϵt, where ϵt N(µ, σ2). In Appendix A1 we present figures plotting six types of time-series data generated by three data generation models (Gaussian, t-distribution, and AR model) with two change patterns (Jumping Mean and Scaling Variance). Evaluation Metric All the compared methods output scores at each time point t, which we denote as Score(t). These methods are expected to output higher scores in response to changes occurring at time t Change(i) = 200i + 1 for i = 1, , 99. To capture the response to the changes, we recorded the maximum score up to 100 time points after each t Change(i), denoted as A(i) = maxt Change(i) t