# feature_shift_localization_network__1be25f6a.pdf

Feature Shift Localization Network

M ıriam Barrab es * 1 2 Daniel Mas Montserrat * 1 Kapal Dev 2 Alexander G. Ioannidis 1 3

Feature shifts between data sources are present in many applications involving healthcare, biomedical, socioeconomic, financial, survey, and multisensor data, among others, where unharmonized heterogeneous data sources, noisy data measurements, or inconsistent processing and standardization pipelines can lead to erroneous features. Localizing shifted features is important to address the underlying cause of the shift and correct or filter the data to avoid degrading downstream analysis. While many techniques can detect distribution shifts, localizing the features originating them is still challenging, with current solutions being either inaccurate or not scalable to large and high-dimensional datasets. In this work, we introduce the Feature Shift Localization Network (FSL-Net), a neural network that can localize feature shifts in large and highdimensional datasets in a fast and accurate manner. The network, trained with a large number of datasets, learns to extract the statistical properties of the datasets and can localize feature shifts from previously unseen datasets and shifts without the need for re-training. The code and readyto-use trained model are available at https: //github.com/AI-sandbox/FSL-Net.

1. Introduction

Feature distribution shifts between data sources are common in many real-world applications using multi-dimensional data composed of a set of corrupted features (i.e., dimensions) with mismatching statistical qualities between sources. These feature shifts are prevalent in healthcare,

*Equal contribution 1Department of Biomedical Data Science, Stanford University, Stanford, CA 94305 USA 2Department of Computer Science, Munster Technological University, Cork T12 P928, Ireland 3Department of Biomolecular Engineering, University of California Santa Cruz, Santa Cruz, CA 95060 USA. Correspondence to: Alexander G. Ioannidis <ioannidis@stanford.edu>.

Proceedings of the 42 nd International Conference on Machine Learning, Vancouver, Canada. PMLR 267, 2025. Copyright 2025 by the author(s).

biomedical, and life sciences datasets, where different samples are generated at different organizations (e.g., hospitals, labs), with differing lab technologies, hardware, and data processing producing unique structural biases. In clinical genomics, feature shifts can arise from heterogeneous data acquisition protocols, which may involve differences in genotyping arrays or phenotype curation procedures (Moreno Grau et al., 2024). Similar shifts are found in social sciences, streaming, and e-business applications, where combining tabular and structured data from multiple sources, regions, and times without proper homogenization steps can lead to mismatching and biased features due to incorrect data collection procedures, human entry errors, faulty standardization, or erroneous data processing (Barchard & Pace, 2011; Dai et al., 2015). Similarly, multi-sensor applications in the manufacturing industry, medicinal monitoring, finance analysis, and defense can suffer feature shifts due to faulty sensors and measuring devices (Qian et al., 2022; Barrab es et al., 2023). While numerous techniques enable pre-processing data sources to reduce feature shifts, proper data homogenization can be a challenging task requiring data-dependent and domain-specific techniques (Lim et al., 2018). When data homogenization fails, unattended feature shifts can negatively impact downstream applications, leading to erroneous scientific results or biased machine learning models, which makes feature shift localization critical in many data-driven domains.

Feature shift localization is the task of enumerating which features of multi-dimensional datasets are originating the distribution shift between two or more data sources. The localization step is necessary to identify and correct the error source, whether by data removal or correction in tabular data-based applications or through physical intervention in multi-sensory scenario applications. Extensive literature on anomaly detection and distribution shift detection (Yu et al., 2018; Pan et al., 2020) has led to numerous tools for automated shift detection that are common in data-centric AI (DCAI) and machine learning systems (MLSys) technologies, providing functionalities for data quality control, homogenization, and monitoring (Ginart et al., 2022; Piano et al., 2022; Zha et al., 2023; Subasri et al., 2023). While many methods focus on asserting whether two datasets follow the same distribution, most do not localize the exact features causing the shift, and recent promising shift localization techniques still fail to scale to large and high-

Feature Shift Localization Network

dimensional datasets common in many areas (Kulinski et al., 2020; Barrab es et al., 2023).

In this work, we introduce a novel neural network, the Feature Shift Localization Network (FSL-Net), which can localize shifts with high accuracy while scaling to highdimensional and large datasets. The network extracts statistical descriptors from two datasets and then processes them to localize the features originating the distribution shift. Namely, FSL-Net has two subnetworks, a Statistical Descriptor Network that compresses datasets into statistical functionals that summarize their underlying distribution, and a Prediction Network that combines the statistical descriptors between datasets to predict the probability of being corrupted for each feature. FSL-Net makes use of convolutional and pooling layers to achieve invariance to sample order and approximate equivariance to feature order. The network is trained end-to-end using multiple datasets with different types of simulated feature shifts and is evaluated on previously unseen datasets and shift types, showing that it generalizes well out-of-the-box to a large variety of data and shifts without the need for re-training. FSL-Net surpasses previous feature shift localization methods in localization accuracy and speed.

Our contributions include: (1) we propose a novel neural network architecture that provides invariance to sample order and equivariance to feature order while scaling to large and high-dimensional datasets; (2) we design a training approach leading to a network that generalizes to unseen data and shifts without the need for re-training; (3) we provide an in-depth experimental evaluation with multiple manipulation types, datasets, and network configurations.

2. Related Work

Distribution Shift Detection. The detection of distribution shifts consists of predicting if p = q, where p and q are the reference and query distributions, respectively. Numerous techniques exist for detecting distribution shifts in univariate distributions (Gama et al., 2014; Lu et al., 2018; Pan et al., 2020), and there is a growing focus on multivariate data (Rabanser et al., 2019), which can exhibit various types of shifts such as marginal, concept, covariate, or label shifts (Lu et al., 2016; Losing et al., 2016; Liu et al., 2020). Recent shift detection techniques include (Yu et al., 2018), which makes use of hypothesis testing for concept drift detection, and (Rabanser et al., 2019), which applies two-sample multivariate hypothesis testing via Maximum Mean Discrepancy (MMD) (Gretton et al., 2012), univariate hypothesis tests with marginal Kolmogorov-Smirnov (KS) tests, and dimensionality reduction techniques.

Feature Shift Localization. While distribution shift detection techniques focus on detecting whether a shift exists

between distributions, feature shift localization methods aim to predict which features are causing the shift. A notable contribution in this area is the work by (Kulinski et al., 2020), which introduces a conditional test capable of accurately identifying shifted features with model-free and model-based approaches: K-Nearest Neighbors with KS statistic (KNN-KS), multivariate Gaussian with KS (MBKS), multivariate Gaussian and Fisher-divergence test statistics (MB-SM), and deep density neural models with Fisherdivergence test (Deep-SM). Data Fix (Barrab es et al., 2023) is a more recent method that improves localization accuracy by iteratively training a random forest to distinguish between reference and query distributions, removing the features with the highest impurity-based importance scores until divergence is minimized, and applying a knee-detection algorithm to determine the optimal stopping point. Although Data Fix performs well in many cases, it struggles with detecting challenging feature shifts and scales poorly with high-dimensional and large datasets. Its repeated use of random forest training results in significant computational overhead, limiting its applicability to real-world applications involving massive datasets. In this paper, we adopt the same evaluation benchmark as Data Fix and introduce FSL-Net to overcome these limitations.

Feature Selection. Feature selection methods localize the most relevant features for classification or regression, providing interpretability and removing redundancy. Wrapper (Maldonado & Weber, 2009; Mustaqeem et al., 2017), filtering (Nasir et al., 2020; Hopf & Reifenrath, 2021), and embedded methods (Tran et al., 2016; Huang et al., 2018) are among the most common techniques. Wrapper methods select features of interest by training ML models and adding or removing features through a search process. Filtering methods include Mutual Information (MI) (Battiti, 1994), ANOVA-F test (Elssied et al., 2014), Chi-square test (Bahassine et al., 2020), Minimum Redundancy Maximum Relevance (MRMR) (Ding & Peng, 2005; Li et al., 2018), and Fast-Conditional Mutual Information Maximization (FAST-CMIM) (Fleuret, 2004), among others. Such methods extract statistical information from the data to rank the importance of each feature. Embedded methods rank features using built-in scores from ML models, such as logistic regression weights (Cheng et al., 2006) or the mean decrease in impurity (Gini index) in random forests (Sylvester et al., 2018), selecting those with the highest scores.

Optimal Transport. Optimal transport (OT) theory compares probability distributions by computing the minimal cost required to transform one into another, inducing a meaningful distance that reflects both global structure and the geometry of the underlying space. Its formulation as a linear programming problem (Kantorovitch, 1958) connected OT to the broader field of optimization (Quanrud, 2018). Its relevance has since expanded across fields, including

Feature Shift Localization Network

computer vision (Izquierdo & Civera, 2024), economics (Galichon, 2018), logistics (Nadal-Roig & Pl a-Aragon es, 2015), and statistical inference (Goldfeld et al., 2024). Recent advances in scalable numerical solvers, such as entropic regularization and Sinkhorn iterations, have enabled OT to scale to high-dimensional settings and find applications in data science (Peyr e et al., 2019; Montesuma et al., 2024), generative modeling (Sanjabi et al., 2018), and domain adaptation (Courty et al., 2016). However, while OT techniques can characterize divergences, they do not provide a direct methodology to localize divergent features, which is the main focus of this paper, and would require modifications in order to be applied for the feature shift localization task.

Data-centric AI. Data-centric AI (DCAI) is the paradigm that encapsulates tools and techniques aimed at improving data quality and quantity to build robust, accurate, and efficient AI systems. Unlike model-centric approaches, which prioritize refining models while working with a fixed dataset, DCAI emphasizes improving datasets through systematic and iterative processes. Key aspects include expanding datasets through data collection (Ghosh & Kaviraj, 2023), annotation (Boecking et al., 2020; Caporali et al., 2023), augmentation (Montserrat et al., 2017; Geleta et al., 2023), and integration (Xiaojuan & Yu, 2023), as well as refining data through cleaning (Krishnan & Wu, 2019; Costanzo, 2023; Barrab es et al., 2024) and feature engineering (Sinaci et al., 2023; Buckley et al., 2023). The increasing complexity and scale of datasets have made automated methods indispensable for data refinement. A growing focus within DCAI is the localization of feature shifts (Zha et al., 2023; Barrab es et al., 2023).

Deep Sets, Equivariant, and Invariant Networks. Invariance and equivariance properties are important in many applications and have proved successful in modeling physics and chemical systems, with numerous neural networks developed to have such properties (Benton et al., 2020; Batzner et al., 2022; Ruhe et al., 2024). Graph neural networks and attention-based networks have been shown to provide similar properties (Lim & Nelson, 2022). Deep Sets (Zaheer et al., 2017) introduced an architecture for modeling setstructured data, ensuring invariance on the order of samples. Our proposed FSL-Net adopts similar design choices as Deep Sets to obtain sample-order invariance and approximate feature-order equivariance.

Section A provides a more detailed description of the benchmarking methods evaluated in this paper, including Data Fix, MB-SM, MB-KS, KNN-KS, and Deep-SM, as well as MI, Select KBest, MRMR, and Fast-CMIM.

3. Feature Shift Localization Network

Problem Formulation. We follow a similar problem formulation as described in (Kulinski et al., 2020; Barrab es et al., 2023), with minor adaptations:

Definition 1. [Empirical Feature Shift Localization Task] We are given two sets of d-dimensional samples X = {x1, x2, ..., x N} and Y = {y1, y2, ..., y M} from distributions p and q respectively, with xi, yi Rd, |X| = N, and |Y | = M. The feature shift localization task consists of predicting the subset of shifted features C from the input data X and Y using a mapping F, such that C = F(X, Y ), that satisfies D(p C, q C) = 0, D(p, q) > 0, and C = arg min D(p C,q C)=0 |C|, where D is a valid statistical distance.

A feature shift between d dimensional distributions p and q is present if after removing the corrupted dimensions C and keeping only the non-corrupted dimensions S = C, with d = |S| + |C|, the divergence between the restricted distributions is D(p S, q S) = 0. The number of corrupted features |C| is assumed to be unknown. We refer to X and Y as the reference and query datasets, and to p and q as the reference and query distributions, respectively. In practice, p and q are unknown and only accessible through the samples X and Y , requiring the task to be approximated by a method F that maps the input data to the predicted set of corrupted features: ˆC = F(X, Y ). Such method F can have the form of machine learning-based hypothesis testing (Kulinski et al., 2020), iterative heuristic algorithms as in Data Fix (Barrab es et al., 2023), or the end-to-end trained parametric neural network introduced in this paper. Typically, F is designed to predict a set ˆC that is as close as possible to the true set of corrupted features C. This set can be represented either as a collection of positional indices or as a d dimensional Boolean vector C {0, 1}d, where each entry indicates whether a corresponding feature is corrupted or not. Note that the vector representation is equivariant to the feature ordering. In this paper, we use the set and vector notations interchangeably unless unclear from the context. We focus on scenarios where either no features or only a subset exhibit a shift. If all features experienced a shift, it would be impossible to determine whether the differences arise naturally between the reference and the query. Thus, we base our approach on the assumption that the true, unmodified query originates from the same distribution as the reference.

As described in (Barrab es et al., 2023), the presented definition of feature shifts covers a wide range of distribution shifts: marginal shifts with D(pi, qi) > 0, where pi and qi represent the marginal distribution of the ith dimension; correlation shifts with D(p, q) > 0 and D(pi, qi) = 0 for all i, where marginal distributions match but the multi-variate distributions do not; and similarly, correlation shifts with

Feature Shift Localization Network

Figure 1. Diagram illustrating the FSL-Net architecture.

D(p S, q S) = 0 and D(p C, q C) = 0 but D(p, q) > 0, where correlations are maintained locally, but a shift is present when considering C and S simultaneously. Note that this framework can also model label shifts in regression or classification tasks by simply considering the label as an additional dimension of p and q.

Feature Shift Localization Network Overview. The proposed Feature Shift Localization Network (FSL-Net) is a model trained end-to-end to predict the set of corrupted features: ˆC = Fθ(X, Y ). FSL-Net infers the probability of each feature being part of C; that is, the network takes as input the reference and query datasets and predicts a d dimensional vector of probabilities, ˆP = ψθ(X, Y ), such that the ith dimension indicates the probability of the ith feature being corrupted ˆP(i C) = ˆPi = ψθ(X, Y )i. The complete predicted set of corrupted features ˆC can then be obtained by selecting all the features with a probability higher than 0.5: ˆC = {i : ˆP(i C) > 0.5}.

The Feature Shift Localization Network has two main subnetworks: the Statistical Descriptor Network ϕθ, which gen-

erates a finite-dimensional vector summarizing the input distribution µp = ϕθ(p) (and equivalently for q), and the Prediction Network γθ, which takes both vectors and predicts the corruption probability for each feature: ˆP = γθ(µp, µq). The network is designed to generalize well across datasets of different feature dimensionalities and sample sizes, and by using convolutions, it can scale to high-dimensional and large datasets without requiring re-training.

Statistical Functionals and Statistical Functional Maps. A statistical functional is a mapping T that takes as input a cumulative density function (CDF) P (or similarly a pdf) of a distribution p and outputs a scalar or vector µ = T(P) = T(p). Some examples of statistical functionals include the mean, variance, mode, and histograms of the distribution. Because p and q are unknown and only accessible through X and Y , we extract the statistical functionals from the empirical distributions p N and q M, which are constructed by assigning equal probability mass to each of the N and M samples of X and Y , respectively. Statistical functionals of interest include linear functionals of

Feature Shift Localization Network

the form µ = A(p) = R g(x)d P(x), which can be expressed as a simple average for empirical distributions: µN = A(p) = 1 N PN j=1 g(xj). Note that g( ) does not need to be a linear mapping and can be any (potentially nonlinear) function, even a neural network. Empirical mean and histogram estimates are examples of linear functionals.

In this paper, we extend the concept of statistical functionals to statistical functional maps mappings T that project a d dimensional multivariate CDF P (or equivalently a pdf p) into a d t tensor µ Rd t, where the ith and kth component µi,k = T k(P, i) is obtained by applying the mapping T k to the multivariate distribution P while using the positional information i (i.e., dimension index). An example of a statistical functional map is a tensor representing a histogram of t bins for each of the d marginal distributions, which can be computed as µH i,k = 1

N PN j=1 1bk(xj,i), with 1 k t and 1 i d, where xj,i is the ith dimension of the jth sample, and 1bk(x) equals 1 if x is in the interval bk defining the kth bin, and 0 otherwise. Other examples include tensors capturing the first t moments of the d marginal distributions or the d d covariance matrix µC i,k = Covp(i, k) of distribution p, with t = d. Statistical functional maps, indexed by dimension i, provide finitedimensional summaries of multivariate distributions, where each ith component captures the statistical properties of the ith dimension of the distribution and its interactions with other dimensions.

Statistical Descriptor Network. The Statistical Descriptor Network is the first component of FSL-Net. This network extracts multiple statistical functional maps from the reference and query datasets, which are then fed into the Prediction Network to localize potential shifts. The Statistical Descriptor Network extracts three statistical functional maps: (1) a non-parametric map µS = ϕS(p N), with µS Rd t1, consisting of simple statistical measures such as marginal means and histograms; (2) a map predicted by a parametric shallow network named Moment Extraction Network µτ = ϕτ(p N), with µτ Rd t2, designed to extract second and higher moments of the data; and (3) a final map extracted by a parametric deep residual network named Neural Embedding Network µω = ϕω(p N), with µω Rd t3, designed to extract a richer representation of the dataset. The maps are concatenated, generating a unique map µp Rd t describing distribution p (and similarly for q), with t = t1 +t2 +t3. Note that d varies with each input dataset, but t remains fixed across datasets and only depends on the network hyperparameters. We use µp = ϕθ(p N) = [µS; µτ; µω] to denote the complete mapping of the three statistical functional maps for distribution p N.

Statistical Measures. The first statistical functional map µS = ϕS(p N) contains various measures that capture key statistical properties of individual features in the reference

and query datasets. Namely, these measures include the mean (indicating the average value), standard deviation (indicating the spread), median (indicating the midpoint), mean absolute deviation (indicating the average distance from the mean), p order moments (capturing univariate higher-order characteristics), marginal histograms (approximating marginal pdfs), and empirical marginal CDFs. Table 1 presents the formulas for each measure. Note that except for histograms, empirical CDFs, and p order moments, all measures can be represented with arrays of dimensionality d 1, while the histograms and empirical CDFs are represented in tensors of size d th, where th is the number of bins or powers p used. By concatenating all the measures, we obtain a statistical functional map µS Rd t1.

Table 1. Statistical measures of the Statistical Descriptor Network. N is the number of observations in X, xj denotes the ith dimension of the jth sample (subscript i is omitted for brevity), ϵ is a small positive constant for numerical stability, bk and ck are histogram intervals and CDF thresholds, respectively, and x and σ are the empirical mean and standard deviation of the ith dimension.

Statistical Measure Linear Equation µi,k

Mean Yes x = 1 N PN j=1 xj

Standard Deviation No σ = q

1 N PN j=1(xj x)2

Median No x( N

2 ) Mean Absolute Deviation Yes 1 N PN j=1 |xj x| p-order Moments Yes 1 N PN j=1 xp j Histogram Yes 1 N PN j=1 1(xj bk) Empirical CDF Yes 1 N PN j=1 1(xj ck)

Moment Extraction Network. The previously described statistical measures work well for capturing marginal distributions, but they fail to capture correlations and higherorder relations between dimensions. While the covariance matrix could address this, it becomes impractical for highdimensional datasets due to its quadratic growth with the number of features. In order to capture higher-order relations between features, we make use of the Moment Extraction Network µτ = ϕτ(p N) = 1

N PN i=1(Wxi+b)+, where (W, b) represent an affine mapping, and ( )+ is the Re LU activation function. In order to adapt to changing input dimensionalities d, the affine mapping is parametrized using a convolutional layer and a batch normalization layer. By applying padding, the output of the network has dimensionality d t2, where t2 is the number of output channels of the convolution. The convolutional and Re LU layers are applied to each sample xj independently, and a sample-wise mean pooling operation is applied to obtain a dataset-level vector. Note that this network can be seen as approximating generalized moments of the data (Gretton et al., 2012; Li et al., 2015; Perera et al., 2022). Both the Moment Extraction Network and the Neural Embedding Network are trained jointly with the Prediction Network by using a cross-entropy loss and auxiliary loss (see the loss functions Section).

Feature Shift Localization Network

Neural Embedding Network. To complement the statistical measures and the Moment Extraction Network, we include a convolutional deep residual network to predict a linear statistical functional map: µω = ϕω(p N) = 1

N PN j=1 ωθ(xj).

The Neural Embedding Network, as well as the Prediction Network detailed below, is built upon residual blocks. Each residual block consists of alternating 1D-convolutional layers across features, batch normalization (BN), and a hyperbolic tangent activation function (Tanh). Skip connections are incorporated to facilitate efficient information flow and mitigate vanishing gradients. We conducted experiments with attention-based Multi-Layer Perceptrons (MLPs) using unit-kernel convolutions, but a fully convolutional design worked best (see Section D).

The ith output of the statistical functional map predicted by the Neural Embedding Network has the form µω,i =

1 N PN j=1 Wxj,i + PH h=1 ωh(xj) , where Wxj,i is an

affine transformation of the ith input feature, and ωh(xj) denotes the output of each of the H residual blocks, capturing non-linear relationships between features.

The complete Statistical Descriptor Network is applied to X and Y , yielding µp = ϕθ(p N) = ϕθ(X) and µq = ϕθ(q M) = ϕθ(Y ). Note that ϕθ( ) is shared between the reference and the query. These descriptors are then fed into the Prediction Network.

Prediction Network. The Prediction Network γθ combines the statistical functional maps µp = ϕθ(p N) and µq = ϕθ(q M) to predict a vector of probabilities ˆP = γθ(µp, µq), indicating the likelihood of each feature belonging to the corrupted set C. These maps are combined through an operation α( ), producing a joint statistical map µp,q = α(µp, µq). After evaluating various merging operations (see Section E.3), we selected the normalized squared difference: α(µp, µq) = (µp µq)2

||µp||+ϵ , where ϵ is a small positive constant for numerical stability. This approach enables the network to compare statistical maps in a manner that accounts for their relative magnitudes. The resulting joint representation has dimensions µp,q Rd t. The Prediction Network then applies multiple residual blocks to the joint map µp,q and produces the final probability estimates through a sigmoid activation layer. Note that, by employing the squared difference, the Prediction Network observes a difference between statistical maps resembling the Maximum Mean Discrepancy (MMD) metric (Gretton et al., 2012; Li et al., 2015) and acts as a mapping from a distance between distributions into shift probability estimates.

The complete structure of the feature shift localization network has the following form:

ˆP = ψθ(p N, q M) = ψθ(X, Y ) = γθ(ϕθ(p N), ϕθ(q M)) (1)

Loss Functions. FSL-Net is trained end-to-end using the binary cross-entropy loss between the predicted probabilities ˆP and the ground truth corrupted feature set C: ℓCE(C, ˆP) = Pd k=1 Ck log( ˆPk) + (1 Ck) log(1 ˆPk). We also add an auxiliary loss function to the predicted statistical functional maps to encourage the learning of useful discriminative features and enforce locality: ℓaux(C, µp, µq) = ||µp C µq C||2

||µp C µq C||2 . The loss is computed and averaged across multiple labeled datasets with simulated shifts, denoted as D = {(C(z), X(z), Y (z)) : 1 z ND}, resulting in the total loss function: L(ψθ, D) = PNd z=1 ℓCE C(z), ψθ(X(z), Y (z)) + λℓaux(C(z), ϕθ(X(z)), ϕθ(Y (z))). In practice, L(ψθ, D) is approximated by using mini-batches and gradient accumulation and optimized with the Adam optimizer.

Sample-wise Invariance, Feature-wise Equivariance, and Locality. Ensuring feature equivariance and sample invariance can help neural networks generalize across datasets with varying numbers of features and samples; (a) Samplewise invariance: The mean pooling operation used in the linear functionals, Neural Embedding, and Moment Extraction Networks provides maps with shapes that are independent of dataset size, allowing FSL-Net to generalize across datasets of different dimensions. Additionally, the non-linear statistical functionals are computed with the samples sorted from smallest to largest, making them invariant to sample order; (b) Feature-wise invariance: The statistical measures are applied to the marginal distributions, making them equivariant to feature ordering. Convolutions, however, are not typically feature-wise equivariant. Instead, FSL-Net approximates feature invariance by shuffling features in each training mini-batch, enforcing learned representations to be invariant to feature order; (c) Locality: We enforce that the ith dimension of the statistical functional maps primarily captures the statistical properties of the ith input feature by (1) using marginal distribution-based statistical measures, (2) incorporating a residual connection in the Neural Embedding Network, and (3) applying the auxiliary loss to the statistical functional maps.

Training and Validation Datasets. We source a total of 1,032 diverse tabular datasets from Open ML (Van Rijn et al.,

2013), with 10 28k features and 500 3.6M samples, covering continuous, categorical, and mixed data types. Additionally, we generate 368 simulated datasets: 184 based on probabilistic distributions (Gaussian, Bernoulli, and Beta mixtures) and 184 from algebraic functions (Polynomial, Sine, and Logarithmic), each with 5,000 samples and 1,000 features. In total, 1,350 datasets are used for training, with

Feature Shift Localization Network

Table 2. Manipulation types applied to continuous and/or categorical features during training and validation.

Type Mapping Description Shift Data

T1 βixi βi Uniform(0, 1) Each value is multiplied by a random number between 0 and 1. pi = qi Cont.

T2 βi(1 x)+(1 βi)x βi Uniform(0, 1) Each value is replaced by a linear combination of x and its negation. pi = qi Cont.

T3 xi p N i The ith feature of Y is replaced by sampling from the ref. empirical marginal distribution p N i. pi = qi, p C = q C, q C = Q

i C qi Both

T4 clamp0,1(x + ϵ) ϵ Normal(µ, σ) Add Gaussian noise with µ Uni.( 0.2, 0.2) and σ Uni.(0.001, 0.5). pi = qi, E[qi] E[pi] + µ Cont.

T5 CNN(x) Forward through a CNN with min-max normalization or binarization. pi = qi Both

T6 x C p N C Similar to manipulation (c), but all the features within C are sampled simultaneously from p N C. pi = qi, p C = q C, p = q Both

T7 KNN(x) K = {1 4, 7 9} Predict feature with KNN (Regressor). - Cont.

T8 KNN(x) K = {1 4, 7 9} Predict feature with KNN (Classifier). - Cat.

50 reserved for validation. Section B describes in detail the dataset selection, preprocessing, and simulation procedures.

Training and Validation Manipulations Simulations. During training, datasets are shuffled in sample and feature order, and random subsets of samples (from 1,000 to 10,000) and features (from 8 to 256) are selected, with each feature normalized to a range of 0 to 1. Each subset is then split equally into reference and query samples. In the query set, a random subset of features (up to 25%) is manipulated based on feature type (continuous or categorical). Validation batches follow the same process but are limited to 2,048 features. Manipulations, outlined in Table 2 (T1 to T8), are selected with probabilities inversely proportional to the validation F-1 score performance observed during training for each given type. Multiple manipulations are used to simulate a wide range of feature shifts. Note that the manipulations applied during training and validation differ from those applied to the test set.

4. Experimental Results

Evaluation Setup. Our evaluation setup is consistent with (Barrab es et al., 2023), using the same reference and query sets and optimized benchmarking methods. Hyperparameter tuning for FSL-Net is detailed in Section E. We compare FSL-Net against five feature shift localization methods (Data Fix, MB-SM, MB-KS, KNN-KS, and Deep-SM) and four feature selection methods (MI, Select KBest, MRMR, and Fast-CMIM). The Select KBest method employs the Chisquare test for categorical datasets and the ANOVA-F test for continuous datasets. For MB-SM, MB-KS, KNN-KS, and Deep-SM, the number of manipulated features |C| is provided, while the other methods, including FSL-Net, do not require it, reflecting a more realistic setting.

Evaluation Data. We use the same evaluation manipulations and evaluation datasets as in (Barrab es et al., 2023), with the Gas, Covid, and Energy datasets also aligning with

Table 3. Datasets used during benchmarking.

Dataset Type Dataset # Features # Samples

Gas 8 12,815 Covid 10 9,889 Energy 26 19,735 Musk2 166 6,598 Scene 294 2,407 MNIST 784 70,000 Polynomial 1,000 10,000 Cosine 1,000 10,000 Dilbert 2,000 10,000

Categorical Phenotypes 1,227 31,424 Founders 10,000 4,144 Canine 198,473 1,444

those in (Kulinski et al., 2020). The benchmark datasets vary in data type, feature dimensionality (from 8 to 198,473), and sample size (from 1,444 to 70,000). Table 3 provides an overview of each dataset, detailing its data type (continuous or categorical), sample size, and feature dimensionality. The continuous datasets are sourced from the UCI repository (Gas (Huerta et al., 2016), Energy (Candanedo et al., 2017), and Musk2 (Blake, 1998)) and Open ML (Scene (Boutell et al., 2004), MNIST (Deng, 2012), and Dilbert (Vanschoren et al., 2014)). Additionally, a Covid-19 dataset (Force, 2022) and two simulated datasets generated from algebraic functions that differ from those used in our training and validation simulations (Cosine and Polynomial) (Barrab es et al., 2023) are included. The categorical datasets consist of high-dimensional biomedical data, including the Phenotypes dataset (Qian et al., 2020), a subset of categorical traits from the UK Biobank, the Founders dataset containing binary-coded human DNA sequences (Perera et al., 2022), and the Canine dataset comprising binary-coded dog DNA sequences (Barrab es et al., 2023). Each dataset is normalized on a per-feature basis to a range of 0 to 1, and the samples are evenly divided into two subsets, forming the reference and query sets. As in (Barrab es et al., 2023), a random fraction of query features (5%, 10%, or 25%) undergoes one of 10 manipulation types for continuous data or

Feature Shift Localization Network

select Kbest

0.0 2.5 5.0 7.5 10.0 12.5 15.0 Mean Runtime (hours)

select Kbest

MRMR FAST-CMIM

KNN-KS Deep-SM

0 5 10 15 20 25 30 Max Runtime (hours)

FSL-Net Data Fix

select Kbest

MRMR FAST-CMIM

KNN-KS Deep-SM

phenotypes,

founders mnist

Mean Runtime (hours)

Manipulation Type

0.0 0.2 0.4 0.6 0.8 1.0

0.0 0.2 0.4 0.6 0.8 1.0

FSL-Net Data Fix

MI select Kbest

MRMR FAST-CMIM

MB-SM MB-KS

KNN-KS Deep-SM

Figure 2. Performance and runtime comparison across feature shift localization methods: a) mean F-1 scores across manipulation types, fractions of manipulated features, and datasets; b) mean F-1 scores vs. mean runtime; c) mean F-1 scores vs. maximum runtime; d) mean runtime vs. sample-feature size product per dataset; e) mean F-1 scores by manipulation type; f) mean F-1 scores by dataset.

8 for categorical data (referred to as manipulations E1-E10). This process generates 30 query sets for continuous data and 24 for categorical data, each with a unique manipulation applied to a given feature subset. Section C details the evaluation manipulation types. Note that both testing datasets and manipulations differ from those used during FSL-Net training and validation (see Table 2).

Evaluation Protocol and Hardware Specifications. Performance is evaluated using the F-1 score for feature shift localization accuracy and wall-clock runtime for computational efficiency. Each experiment is run for up to 30 hours, which prevents some methods from completing evaluations on large datasets. To handle incomplete evaluations, we impute missing results with the lowest F-1 score from the same experiment among competing methods and assign them a 30-hour limit. All evaluations were conducted on an Intel Xeon Gold with 12 CPU cores.

Feature Shift Localization Performance and Runtime Comparison. Figure 2 presents the performance and runtime comparison across different methods. Figure 2a presents the average F-1 scores of feature shift localization computed across manipulation types, fractions of manipulated features, and datasets. Figure 2b depicts the mean F-1 scores against the mean runtime in hours, while Figure 2c illustrates the F-1 scores against the maximum runtime, highlighting the worst-case computational demands. Higher positions in these plots indicate better performance, whereas leftward positions indicate lower runtimes, emphasizing the balance between effectiveness and efficiency. Select Kbest and MI, despite their efficiency, exhibit poor performance in feature shift localization. Methods such as MRMR and FAST-CMIM struggle with scalability and display limited localization capabilities. Notably, FSL-Net achieves the highest average F-1 score, with feature shift localization performance comparable to that of Data Fix but at significantly

Feature Shift Localization Network

faster speeds. Specifically, FSL-Net is approximately 36 faster on average than Data Fix, and up to 136 on the highdimensional Phenotypes dataset. Additionally, FSL-Net significantly outperforms MB-SM, MB-KS, KNN-KS, and Deep-SM, even though these methods have the advantage of accessing the ground truth |C|, while also demonstrating substantially faster computational speeds. While Data Fix relies on a computationally intensive iterative optimization heuristic, and MB-SM, MB-KS, KNN-KS, and Deep-SM require dataset-specific training, FSL-Net only requires performing a forward pass through the network for each dataset, providing highly fast inference. Figure 2d shows the mean runtime as a function of the product between the sample and feature sizes for each dataset, confirming that FSL-Net consistently outperforms Data Fix in speed across all datasets, ranking as the second fastest method after select KBest, comparable to MI, but with much higher accuracy. Namely, Data Fix requires up to 13 hours on the high-dimensional Phenotypes dataset from the UK Biobank and 9 hours on the Canine dataset, whereas FSL-Net completes these tasks in just 2 minutes and 12 minutes, respectively. These results highlight the practical advantage of FSL-Net: it is wellsuited for processing high-dimensional large databases, such as those commonly found in biomedicine and e-commerce, making it a valuable alternative to Data Fix due to its efficient scalability and excellent localization accuracy.

Performance across Manipulation Types and Datasets. Figure 2e and Figure 2f show the mean F-1 scores categorized by type of manipulation and dataset, respectively. FSL-Net consistently matches or exceeds the performance of competing methods across all manipulation types, except for E9 and E4, where Data Fix exhibits superior performance. These manipulations introduce only minimal perturbations that FSL-Net fails to localize, though lower probability thresholds may improve detection. Data Fix fails to accurately detect E10, a shortcoming effectively addressed by FSL-Net. Methods relying on univariate tests, such as MRMR and Fast-CMIM, perform well for manipulations causing marginal distribution shifts but fail entirely with manipulations affecting feature correlations (manipulations E3 and E8). In contrast, techniques based on conditional testing (MB-SM, MB-KS, KNN-KS, and Deep-SM), along with Data Fix and FSL-Net, successfully identify these more complex manipulations. FSL-Net shows a clear advantage in high-dimensional datasets (Phenotypes, Founders, and Canine), outperforming all competing methods and highlighting its effectiveness in handling large feature sets.

Ablation Analysis. We assess the impact of each component of FSL-Net s Statistical Descriptor Network by training models with different combinations of its three components: Statistical Measures (SM), Moment Extraction Network (ME), and Neural Embedding Network (NE). All variants include the Prediction Network (PN), except for the SM-

only baseline, where predictions are obtained by directly thresholding the statistical measures. Table 4 presents the mean F-1 score achieved by each configuration. Training lasted three days for all variants, except for the full model, which was trained for an extended period of seven days. Using the Prediction Network in the SM-only baseline greatly improves feature shift localization performance compared to simply thresholding the statistical measures (from an F-1 score of 0.307 to 0.710), emphasizing the crucial importance of the Prediction Network. Each component contributes to performance improvements, suggesting that each component has unique value, with extended training resulting in additional gains.

Table 4. Mean F-1 scores for various configurations of FSL-Net.

Training Duration Model Configuration F-1 Score

SM 0.307 SM + PN 0.710 ME + PN 0.742 NE + PN 0.783 ME + NE + PN 0.770 SM + ME + PN 0.855 SM + NE + PN 0.878 SM + ME + NE + PN 0.889

7-Day SM + ME + NE + PN 0.894

Additional experimental results are presented in Section F, including: (1) a median-based evaluation of feature shift localization performance and runtime across different methods; (2) a detailed performance and efficiency comparison between FSL-Net and Data Fix; (3) extended evaluations of FSL-Net and Data Fix on high-dimensional image datasets (CIFAR10 and COIL-100); (4) an analysis of FSL-Net s runtime improvement over Data Fix; (5) an evaluation of the threshold-based variant of the SM-only baseline; and (6) a qualitative evaluation of FSL-Net on the MNIST dataset.

5. Conclusions

Current feature shift localization methods involve a tradeoff between speed and accuracy when dealing with large data volumes. In this work, we introduced FSL-Net, a novel equivariant neural network that matches or surpasses existing state-of-the-art methods while offering a significant reduction in processing times. FSL-Net leverages neurallearned statistical descriptors, augmented with traditional statistical measures, to effectively capture the input distributions. By contrasting these descriptors across different data sources, the network accurately detects both univariate and multivariate shifts. Most importantly, FSL-Net is designed to manage datasets of varying sizes and to generalize across new distributions without the need for model re-training with each new estimation, providing significant speed and scalability advantages over competing methods.

Feature Shift Localization Network

Impact Statement

This paper presents work whose goal is to advance the field of Machine Learning. There are many potential societal consequences of our work, none which we feel must be specifically highlighted here.

Bahassine, S., Madani, A., Al-Sarem, M., and Kissi, M. Feature selection using an improved chi-square for arabic text classification. Journal of King Saud University-Computer and Information Sciences, 32(2):225 231, 2020.

Barchard, K. A. and Pace, L. A. Preventing human error: The impact of data entry methods on data accuracy and statistical results. Computers in Human Behavior, 27(5): 1834 1839, 2011.

Barrab es, M., Bonet, D., Moriano, V. N., Gir o-i Nieto, X., Montserrat, D. M., and Ioannidis, A. G. Genomic databases homogenization with machine learning. In 2023 IEEE International Conference on Bioinformatics and Biomedicine (BIBM), pp. 2952 2959. IEEE, 2023.

Barrab es, M., Mas Montserrat, D., Geleta, M., Gir o-i Nieto, X., and Ioannidis, A. Adversarial learning for feature shift detection and correction. In Oh, A., Naumann, T., Globerson, A., Saenko, K., Hardt, M., and Levine, S. (eds.), Advances in Neural Information Processing Systems, volume 36, pp. 57597 57638. Curran Associates, Inc., 2023.

Barrab es, M., Perera, M., Moriano, V. N., Gir o-I-Nieto, X., Montserrat, D. M., and Ioannidis, A. G. Advances in biomedical missing data imputation: A survey. IEEE Access, pp. 1 1, 2024. doi: 10.1109/ACCESS.2024. 3516506.

Battiti, R. Using mutual information for selecting features in supervised neural net learning. IEEE Transactions on neural networks, 5(4):537 550, 1994.

Batzner, S., Musaelian, A., Sun, L., Geiger, M., Mailoa, J. P., Kornbluth, M., Molinari, N., Smidt, T. E., and Kozinsky, B. E (3)-equivariant graph neural networks for data-efficient and accurate interatomic potentials. Nature communications, 13(1):2453, 2022.

Benton, G., Finzi, M., Izmailov, P., and Wilson, A. G. Learning invariances in neural networks from training data. Advances in neural information processing systems, 33: 17605 17616, 2020.

Blake, C. Uci repository of machine learning databases. http://www. ics. uci. edu/ mlearn/MLRepository. html, 1998.

Boecking, B., Neiswanger, W., Xing, E., and Dubrawski, A. Interactive weak supervision: Learning useful heuristics for data labeling. ar Xiv preprint ar Xiv:2012.06046, 2020.

Boutell, M. R., Luo, J., Shen, X., and Brown, C. M. Learning multi-label scene classification. Pattern recognition, 37(9):1757 1771, 2004.

Buckley, T., Ghosh, B., and Pakrashi, V. A feature extraction & selection benchmark for structural health monitoring. Structural Health Monitoring, 22(3):2082 2127, 2023.

Candanedo, L. M., Feldheim, V., and Deramaix, D. Data driven prediction models of energy use of appliances in a low-energy house. Energy and buildings, 140:81 97, 2017.

Caporali, A., Pantano, M., Janisch, L., Regulin, D., Palli, G., and Lee, D. A weakly supervised semi-automatic image labeling approach for deformable linear objects. IEEE Robotics and Automation Letters, 8(2):1013 1020, 2023.

Cheng, Q., Varshney, P. K., and Arora, M. K. Logistic regression for feature selection and soft classification of remote sensing data. IEEE Geoscience and Remote Sensing Letters, 3(4):491 494, 2006.

Costanzo, L. Data cleaning during the research data management process. Research Data Management in the Canadian Context, 2023.

Courty, N., Flamary, R., Tuia, D., and Rakotomamonjy, A. Optimal transport for domain adaptation. IEEE transactions on pattern analysis and machine intelligence, 39(9): 1853 1865, 2016.

Dai, T., Hu, H., Wan, Y., Chen, Q., and Wang, Y. A data quality management and control framework and model for health decision support. In 2015 12th International Conference on Fuzzy Systems and Knowledge Discovery (FSKD), pp. 1792 1796, 2015. doi: 10.1109/FSKD.2015. 7382218.

Deng, L. The mnist database of handwritten digit images for machine learning research. IEEE Signal Processing Magazine, 29(6):141 142, 2012.

Ding, C. and Peng, H. Minimum redundancy feature selection from microarray gene expression data. Journal of bioinformatics and computational biology, 3(02):185 205, 2005.

Elssied, N. O. F., Ibrahim, O., and Osman, A. H. A novel feature selection based on one-way anova f-test for email spam classification. Research Journal of Applied Sciences, Engineering and Technology, 7(3):625 638, 2014.

Feature Shift Localization Network

Fleuret, F. Fast binary feature selection with conditional mutual information. Journal of Machine learning research, 5(9), 2004.

Force, C. C. T. United states covid-19 cases and deaths by state over time. 2022.

Galichon, A. Optimal transport methods in economics. Princeton University Press, 2018.

Gama, J., ˇZliobait e, I., Bifet, A., Pechenizkiy, M., and Bouchachia, A. A survey on concept drift adaptation. ACM computing surveys (CSUR), 46(4):1 37, 2014.

Geleta, M., Mas Montserrat, D., Giro-i Nieto, X., and Ioannidis, A. G. Deep variational autoencoders for population genetics. biorxiv, pp. 2023 09, 2023.

Ghosh, E. and Kaviraj, T. Data collection: Best practices and techniques of footprint collection for newborn identification. International Journal of Creative Research and Thoughts (IJCRT), 11(9), 2023.

Ginart, T., Zhang, M. J., and Zou, J. Mldemon: Deployment monitoring for machine learning systems. In International Conference on Artificial Intelligence and Statistics, pp. 3962 3997. PMLR, 2022.

Goldfeld, Z., Kato, K., Rioux, G., and Sadhu, R. Statistical inference with regularized optimal transport. Information and Inference: A Journal of the IMA, 13(1):iaad056, 2024.

Gretton, A., Borgwardt, K. M., Rasch, M. J., Sch olkopf, B., and Smola, A. A kernel two-sample test. The Journal of Machine Learning Research, 13(1):723 773, 2012.

Hopf, K. and Reifenrath, S. Filter methods for feature selection in supervised machine learning applications review and benchmark. ar Xiv preprint ar Xiv:2111.12140, 2021.

Huang, X., Zhang, L., Wang, B., Li, F., and Zhang, Z. Feature clustering based support vector machine recursive feature elimination for gene selection. Applied Intelligence, 48(3):594 607, 2018.

Huerta, R., Mosqueiro, T., Fonollosa, J., Rulkov, N. F., and Rodriguez-Lujan, I. Online decorrelation of humidity and temperature in chemical sensors for continuous monitoring. Chemometrics and Intelligent Laboratory Systems, 157:169 176, 2016.

Izquierdo, S. and Civera, J. Optimal transport aggregation for visual place recognition. In Proceedings of the ieee/cvf conference on computer vision and pattern recognition, pp. 17658 17668, 2024.

Kantorovitch, L. On the translocation of masses. Management science, 5(1):1 4, 1958.

Krishnan, S. and Wu, E. A. Automatic generation of data cleaning pipelines. ar Xiv preprint ar Xiv:1904.11827, 2019.

Kulinski, S., Bagchi, S., and Inouye, D. I. Feature shift detection: Localizing which features have shifted via conditional distribution tests. Advances in Neural Information Processing Systems, 33:19523 19533, 2020.

Li, J., Cheng, K., Wang, S., Morstatter, F., Trevino, R. P., Tang, J., and Liu, H. Feature selection: A data perspective. ACM Computing Surveys (CSUR), 50(6):94, 2018.

Li, Y., Swersky, K., and Zemel, R. Generative moment matching networks. In International conference on machine learning, pp. 1718 1727. PMLR, 2015.

Lim, L.-H. and Nelson, B. J. What is an equivariant neural network. ar Xiv preprint ar Xiv:2205.07362, pp. 5987 6001, 2022.

Lim, S. B., Tan, S. J., Lim, W.-T., and Lim, C. T. A merged lung cancer transcriptome dataset for clinical predictive modeling. Scientific data, 5(1):1 8, 2018.

Liu, A., Lu, J., and Zhang, G. Diverse instance-weighting ensemble based on region drift disagreement for concept drift adaptation. IEEE transactions on neural networks and learning systems, 32(1):293 307, 2020.

Liu, L., Liu, J., and Han, J. Multi-head or single-head? an empirical comparison for transformer training. ar Xiv preprint ar Xiv:2106.09650, 2021.

Losing, V., Hammer, B., and Wersing, H. Knn classifier with self adjusting memory for heterogeneous concept drift. In 2016 IEEE 16th international conference on data mining (ICDM), pp. 291 300. IEEE, 2016.

Lu, J., Liu, A., Dong, F., Gu, F., Gama, J., and Zhang, G. Learning under concept drift: A review. IEEE transactions on knowledge and data engineering, 31(12):2346 2363, 2018.

Lu, N., Lu, J., Zhang, G., and De Mantaras, R. L. A concept drift-tolerant case-base editing technique. Artificial Intelligence, 230:108 133, 2016.

Maldonado, S. and Weber, R. A wrapper method for feature selection using support vector machines. Information Sciences, 179(13):2208 2217, 2009.

Montesuma, E. F., Mboula, F. M. N., and Souloumiac, A. Recent advances in optimal transport for machine learning. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2024.

Feature Shift Localization Network

Montserrat, D. M., Lin, Q., Allebach, J., and Delp, E. J. Training object detection and recognition cnn models using data augmentation. Electronic Imaging, 2017(10): 27 36, 2017.

Moreno-Grau, S., Vernekar, M., Lopez-Pineda, A., Mas Montserrat, D., Barrab es, M., Quinto-Cort es, C. D., Moatamed, B., Lee, M. T. M., Yu, Z., Numakura, K., et al. Polygenic risk score portability for common diseases across genetically diverse populations. Human Genomics, 18(1):93, 2024.

Mustaqeem, A., Anwar, S. M., Majid, M., and Khan, A. R. Wrapper method for feature selection to classify cardiac arrhythmia. In 2017 39th Annual International Conference of the IEEE Engineering in Medicine and Biology Society (EMBC), pp. 3656 3659. IEEE, 2017.

Nadal-Roig, E. and Pl a-Aragon es, L. M. Optimal transport planning for the supply to a fruit logistic centre. Handbook of operations research in agriculture and the agri-food industry, pp. 163 177, 2015.

Nasir, I. M., Khan, M. A., Yasmin, M., Shah, J. H., Gabryel, M., Scherer, R., and Damaˇseviˇcius, R. Pearson correlation-based feature selection for document classification using balanced training. Sensors, 20(23):6793, 2020.

Pan, J., Pham, V., Dorairaj, M., Chen, H., and Lee, J.-Y. Adversarial validation approach to concept drift problem in user targeting automation systems at uber. ar Xiv preprint ar Xiv:2004.03045, 2020.

Perera, M., Montserrat, D. M., Barrab es, M., Geleta, M., Gir o-i Nieto, X., and Ioannidis, A. G. Generative moment matching networks for genotype simulation. In 2022 44th Annual International Conference of the IEEE Engineering in Medicine & Biology Society (EMBC), pp. 1379 1383. IEEE, 2022.

Peyr e, G., Cuturi, M., et al. Computational optimal transport: With applications to data science. Foundations and Trends in Machine Learning, 11(5-6):355 607, 2019.

Piano, L., Garcea, F., Gatteschi, V., Lamberti, F., and Morra, L. Detecting drift in deep learning: A methodology primer. IT Professional, 24(5):53 60, 2022.

Qian, J., Tanigawa, Y., Du, W., Aguirre, M., Chang, C., Tibshirani, R., Rivas, M. A., and Hastie, T. A fast and scalable framework for large-scale and ultrahighdimensional sparse regression with application to the uk biobank. PLo S genetics, 16(10):e1009141, 2020.

Qian, J., Song, Z., Yao, Y., Zhu, Z., and Zhang, X. A review on autoencoder based representation learning for fault

detection and diagnosis in industrial processes. Chemometrics and Intelligent Laboratory Systems, pp. 104711, 2022.

Quanrud, K. Approximating optimal transport with linear programs. ar Xiv preprint ar Xiv:1810.05957, 2018.

Rabanser, S., G unnemann, S., and Lipton, Z. Failing loudly: An empirical study of methods for detecting dataset shift. Advances in Neural Information Processing Systems, 32, 2019.

Ruhe, D., Brandstetter, J., and Forr e, P. Clifford group equivariant neural networks. Advances in Neural Information Processing Systems, 36, 2024.

Sanjabi, M., Ba, J., Razaviyayn, M., and Lee, J. D. On the convergence and robustness of training gans with regularized optimal transport. Advances in Neural Information Processing Systems, 31, 2018.

Sinaci, A. A., Gencturk, M., Teoman, H. A., Laleci Erturkmen, G. B., Alvarez-Romero, C., Martinez-Garcia, A., Poblador-Plou, B., Carmona-P ırez, J., L obe, M., and Parra-Calderon, C. L. A data transformation methodology to create findable, accessible, interoperable, and reusable health data: software design, development, and evaluation study. Journal of medical Internet research, 25:e42822, 2023.

Subasri, V., Krishnan, A., Dhalla, A., Pandya, D., Malkin, D., Razak, F., Verma, A., Goldenberg, A., and Dolatabadi, E. Diagnosing and remediating harmful data shifts for the responsible deployment of clinical ai models. med Rxiv, pp. 2023 03, 2023.

Sylvester, E. V., Bentzen, P., Bradbury, I. R., Cl ement, M., Pearce, J., Horne, J., and Beiko, R. G. Applications of random forest feature selection for fine-scale genetic population assignment. Evolutionary applications, 11(2): 153 165, 2018.

Tran, B., Zhang, M., and Xue, B. A pso based hybrid feature selection algorithm for high-dimensional classification. In 2016 IEEE congress on evolutionary computation (CEC), pp. 3801 3808. IEEE, 2016.

Van Rijn, J. N., Bischl, B., Torgo, L., Gao, B., Umaashankar, V., Fischer, S., Winter, P., Wiswedel, B., Berthold, M. R., and Vanschoren, J. Openml: A collaborative science platform. In Machine Learning and Knowledge Discovery in Databases: European Conference, ECML PKDD 2013, Prague, Czech Republic, September 23-27, 2013, Proceedings, Part III 13, pp. 645 649. Springer, 2013.

Vanschoren, J., Van Rijn, J. N., Bischl, B., and Torgo, L. Openml: networked science in machine learning. ACM SIGKDD Explorations Newsletter, 15(2):49 60, 2014.

Feature Shift Localization Network

Xiaojuan, L. and Yu, Z. A data integration tool for the integrated modeling and analysis for east. Fusion Engineering and Design, 195:113933, 2023.

Yu, S., Wang, X., and Pr ıncipe, J. C. Request-and-reverify: Hierarchical hypothesis testing for concept drift detection with expensive labels. ar Xiv preprint ar Xiv:1806.10131, 2018.

Zaheer, M., Kottur, S., Ravanbakhsh, S., Poczos, B., Salakhutdinov, R. R., and Smola, A. J. Deep sets. Advances in neural information processing systems, 30, 2017.

Zha, D., Bhat, Z. P., Lai, K.-H., Yang, F., Jiang, Z., Zhong, S., and Hu, X. Data-centric artificial intelligence: A survey. ar Xiv preprint ar Xiv:2303.10158, 2023.

Zheng, L., Yuan, J., Wang, C., and Kong, L. Efficient attention via control variates. ar Xiv preprint ar Xiv:2302.04542, 2023.

Feature Shift Localization Network

A. Benchmarking Methods

Data Fix detects and localizes feature shifts using an iterative adversarial approach called DF-Locate. At each iteration, a random forest classifier is trained to distinguish between samples from a reference and a query distribution. The predicted class probabilities from this discriminator are used to compute the total variation distance (TVD), and its feature importance scores, based on the mean decrease of impurity, are used to locate the features originating the shift. These features are removed in successive rounds until the estimated divergence falls below a threshold or until half the features have been eliminated. A final refinement step uses a knee-detection algorithm to choose the optimal stopping point in the removal process. The selection of features is controlled by a dynamic threshold defined as the product of the TVD and a hyperparameter τ.

Classical univariate statistical filters include Mutual Information statistics (MI) and select Kbest. These methods rank features based on their statistical association with the target. MI measures entropy shared between each feature and the target, while Select KBest applies ANOVA-F for continuous features and Chi-squared for categorical ones. ANOVA-F tests variance for mean differences in continuous features across target classes, while the Chi-square test evaluates categorical feature-target dependency by comparing observed frequencies across target classes.

More advanced iterative selection methods include Minimum Redundancy Maximum Relevance (MRMR) and Fast Conditional Mutual Information Maximization (FAST-CMIM). These methods attempt to balance feature relevance with redundancy by considering mutual information conditioned on already selected features. MRMR iteratively selects relevant features while minimizing redundancy with the selected features so far. FAST-CMIM iteratively selects features that maximize MI, conditional to previously selected features. These methods, while more expressive than univariate tests, suffer from poor scalability and cannot process large datasets within practical time constraints.

Other benchmarked methods include model-based and statistical testing techniques specifically designed for feature shift localization. These include MB-SM (Multivariate Gaussian with Fisher-divergence test), MB-KS (Multivariate Gaussian with KS test), KNN-KS (K-Nearest Neighbors with KS statistic), and Deep-SM (deep density neural models with Fisherdivergence test). These approaches operate under stronger assumptions about data structure and require prior knowledge of the number of corrupted features, which is rarely available in real-world applications.

B. Training and Validation Datasets

B.1. Open ML Datasets

We source a diverse collection of datasets from Open ML (Van Rijn et al., 2013), selecting only those in a tabular format with at least 10 features and a minimum of 500 samples. To ensure data quality, we preprocess the datasets by removing constant features and addressing missing values using two strategies. In the first strategy, we remove features with more than 40% missing values, followed by the elimination of samples with missing data. In the second strategy, we remove features with more than 70% missing values before discarding samples with missing data. If both strategies preserve the required dimensions, we randomly select one for application. We also carefully remove any Open ML dataset that overlaps with the test datasets. For efficiency, large datasets exceeding 40M cell values are partitioned into ten equal-sized sets to accelerate processing. As a result, we obtain a total of 1,032 cleaned and partitioned datasets from Open ML.

B.2. Algebraic Simulated Datasets

We generate 184 simulated datasets using Polynomial, Sine, and Logarithmic functions, each containing 5,000 samples and 1,000 features. Feature values from Polynomial functions are derived from f(x) = ax4 + bx3 + cx2 + dx + e, with parameters a, b, c, d, and e uniformly sampled from [ 50, 50], and the degree randomly set to either 3 or 4. Sine functions follow f(x) = a cos(bx + c), where parameter a is sampled from [ 50, 50], and b and c are sampled from [ π, π]. Logarithmic functions are defined as loga(x + 6), with base a sampled from [2, 10]. The input values, x, are drawn from the range [ 5, 5]. Function parameters remain fixed across all samples.

B.3. Probabilistic Simulated Datasets

We generate 184 simulated datasets based on probabilistic distributions, including Gaussian, Bernoulli, and Beta mixture models. Each dataset consists of 5,000 samples and 1,000 features. To construct these datasets, we begin by selecting a base distribution D from {Gaussian, Bernoulli, Beta} and initializing parameters for a mixture model with a randomly determined

Feature Shift Localization Network

number of components K, where 1 K 100. For the Gaussian distribution, the mean vector µk = (µk1, µk2, . . . , µkd) is drawn from a standard normal distribution N(0, 1), while the variance vector σ2 k = (σ2 k1, σ2 k2, . . . , σ2 kd) is sampled from a Uniform distribution U(0.1, 1.1) for each feature independently. For the Bernoulli distribution, the probability vector pk = (pk1, pk2, . . . , pkd), which represents the probability of success for each feature, is sampled from a uniform distribution U(0, 1). For the Beta distribution, the parameters αk = (αk1, αk2, . . . , αkd) and βk = (βk1, βk2, . . . , βkd) are drawn from a uniform distribution U(1, 2). Samples are generated using the weighted sum of these mixture components, defined as PK k=1 πk D(θk), where πk represents the mixing coefficients, satisfying PK k=1 πk = 1.

Each dataset has a 25% chance of undergoing multiple transformations, with the number of transformation steps I randomly chosen between 1 and 5. These transformations include normalization (standard or min-max scaling), followed by a linear transformation in which the dataset is multiplied by a randomly sampled matrix W Rd d. The entries of W are drawn from one of four distributions: U(0, 1), N(0, 1), Beta(1, 1), or Bernoulli(0.5). Additionally, a non-linear transformation is applied, randomly chosen from the Re LU, GELU, sigmoid, hyperbolic tangent, or logarithmic transformation. To preserve structural information, residual connections are sometimes applied, where the transformed data is combined with its previous or initial state, either directly or after undergoing normalization.

Feature Shift Localization Network

C. Testing Manipulation

Table 5 outlines the manipulation types used to induce shifts during evaluation. Some distort marginal distributions (E1, E2, E4, E5, E6, and E7), with manipulation E4 leaving the mean approximately unchanged. Others shuffle feature values across samples, altering feature correlations but not marginal distributions (E3 and E8). Manipulations E9 and E10 use KNN predictions to replace corrupted continuous and categorical features, respectively. For a more detailed explanation, see (Barrab es et al., 2023).

Table 5. Manipulation types applied to continuous and/or categorical features during benchmarking (Table from (Barrab es et al., 2023)).

Type Mapping Description Shift Data

E1 x Uniform(0, 1) Each value is substituted by a random number between 0 and 1. pi = qi Cont.

E2 1 x Each value is negated. pi = qi, E[qi] = 1 E[pi] Both E3 Pi Xi Pi is a random permutation matrix applied to feature i. pi = qi, p C = q C, q C = Q

i C qi Both

E4.1-4.3 clamp0,1(x + ασ) σ Rademacher(0.5) Add constant noise with a random sign. α {0.02, 0.05, 0.1} for 4.1-4.3 respectively. pi = qi, E[pi] E[qi] Cont.

E5 round(x) Values are binarized. pi = qi Cont. E61-6.3 b(1 x) + (1 b)x b Bernoulli(ρ) Values are negated with probability ρ {0.2, 0.4, 0.6} for 6.1-6.3 respectively. pi = qi, E[qi] = ρ + (1 2ρ)E[pi] Cat.

E7 MLP(x) Forward through an MLP with min-max normalization or binarization. pi = qi Both

E8 P Xi P is a random permutation matrix applied to all features simultaneously. pi = qi, p C = q C, p = q Both

E9 KNN(x) Predict feature with KNN (Regressor). - Cont. E10 KNN(x) Predict feature with KNN (Classifier). - Cat.

Feature Shift Localization Network

D. Exploration of Attention Mechanisms

We investigated the integration of attention mechanisms across features within the residual blocks of both the Neural Embedding Network and the Prediction Network to effectively capture diverse data patterns and feature correlations. Upon incorporation, the attention mechanisms were coupled with MLPs implemented using convolutional layers with unit-sized kernels.

D.1. Efficient EVA Attention Layers

Using full attention across features was infeasible for high-dimensional datasets due to its quadratic computational requirements. Consequently, we investigated the use of EVA attention layers (Zheng et al., 2023). Attention layers without positional embeddings are inherently equivariant; therefore, we did not include positional embeddings. However, efficient (approximate) attention layers, such as EVA, can fail to preserve equivariance under certain conditions.

D.2. Sequence Handling and Window Size Adjustment

Key considerations involve ensuring that the input size to the EVA layer is not only smaller than the sequence length but also divisible by it. Managing variable-length sequences requires meticulous data processing, achieved through padding and the application of a key padding mask for efficient sequence handling. The window size w of the EVA attention was dynamically adjusted based on the sequence length to optimize processing efficiency and effectiveness, specifically determined as w = min max 8, d

D.3. Limitations

Despite EVA attention layers being more efficient than traditional attention mechanisms (Liu et al., 2021), their integration into both the Neural Embedding Network and the Prediction Network resulted in CUDA out-of-memory issues for certain configurations. Consequently, we attempted to incorporate attention solely within the Prediction Network. While this approach alleviated memory issues, the performance remained below that of our baseline configuration without any attention mechanisms.

Feature Shift Localization Network

E. Hyperparameter Tuning

E.1. Training Setup and Hardware Specifications

We conducted a random search across various hyperparameters of the FSL-Net architecture and its training strategy. Training for each network was terminated if the validation loss did not improve for 50 consecutive evaluation intervals, with the maximum training duration limited to three days. To expedite the training process, a single NVIDIA-SMI GPU with 32GB of memory was used. The networks were validated on 50 validation datasets every 2,500 training iterations. The model checkpoint achieving the highest validation accuracy was selected as the optimal network and subsequently trained for up to seven days. This best-performing model was employed for inference and evaluation in the benchmarking experiments.

E.2. Alternative Statistical Measures

In addition to the statistical measures delineated in the main text, we investigated the following metrics: skewness (indicating asymmetry), kurtosis (reflecting tail heaviness), index of dispersion (representing the variance-to-mean ratio), and trimmed mean deviation (a robust estimator of central tendency that excludes outliers). The formulas for each measure are presented in Table 6. These measures were ultimately excluded from the final analysis, as they did not confer any discernible advantage.

Table 6. Formulas for additional statistical measures of the Statistical Descriptor Network. N is the number of observations in X, xj represents the ith dimension of the jth sample (we skip the subscript i for brevity), x(l) is the ith dimension of the lth sample after sorting the samples from smallest to largest, r is the count of observations trimmed from each end in trimmed mean deviation, and x and σ are the empirical mean and standard deviation of the ith dimension.

Statistical Measure Linear Equation µi,k

Skewness Yes 1 N PN j=1 xj x

Kurtosis Yes 1 N PN j=1 xj x

Index of Dispersion Yes σ2 x+ϵ Trimmed Mean Deviation No 1 N r PN r j=r+1 x(j)

E.3. Alternative Merging Operations

In addition to the normalized squared difference discussed in the main text, we examined several alternative merging operations within the Prediction Network. The following methods were evaluated; however, none yielded performance improvements over the existing approach: concatenation α(µp, µq) = [µp; µq]; element-wise difference α(µp, µq) = µp µq; and squared difference α(µp, µq) = (µp µq)2. Note that, for the concatenation method, the merged feature vector has dimensions µp,q Rd 2t, whereas for the other merging operations, it has dimensions µp,q Rd t.

E.4. Hyperparameter Search

E.4.1. NETWORK TUNING PARAMETERS

Table 7 provides an overview of the search spaces and the optimal values determined for each network-related hyperparameter in FSL-Net. These parameters encompass configurations for the statistical measures, Moment Extraction Network, Neural Embedding Network, and Prediction Network. For parameters not explicitly specified, default values were applied.

E.4.2. ATTENTION TUNING PARAMETERS

Table 8 presents the hyperparameters of the attention mechanism explored during the optimization process. These parameters regulate the EVA attention layers within FSL-Net, including the number of heads, the number of landmarks, the window factor, and dropout rates.

E.4.3. OPTIMIZATION TUNING PARAMETERS

Table 9 outlines the search space and optimal values for optimization hyperparameters in FSL-Net s training strategy. This includes the loss function and Adam optimizer settings.

Feature Shift Localization Network

Table 7. Search space and optimal values for tuned network hyperparameters in FSL-Net.

Component Hyperparameter Possible Values Optimal Value

Statistical Measures Mean {True, False} True Standard Deviation {True, False} True Median {True, False} True Mean Absolute Deviation {True, False} True p-order Moments {True, False} True p {2}, {3}, {2, 3} 2, 3 Histogram {True, False} True Empirical CDF {True, False} True # Bins {100}, {50, 100} 100 Skewness {True, False} False Kurtosis {True, False} False Index of Dispersion {True, False} False Trimmed Mean Deviation {True, False} False Trimmed Percentage {0.1} 0.1

Moment Extraction Network # Kernels {32, 64, 128} 64 Kernel Size {75, 125} 75 Dilation {1} 1 Activation {Re LU, Tanh} Re LU

Neural Embedding Network # Residual Layers {3, 5, 7} 5 # Kernels {32, 64} 64 Kernel Size {5, 7} 5 Dilation {1} 1 Activation {Tanh, GELU} Tanh

Prediction Network # Residual Layers {3, 5, 7} 7 # Kernels {32, 64} 64 Kernel Size {5, 7} 5 Dilation {1} 1 Activation {Tanh, GELU} Tanh Combination {Concatenation, Element-wise difference, Squared difference, Normalized squared difference}

Normalized squared difference

Table 8. Search space and optimal values for tuned attention hyperparameters.

Component Hyperparameter Possible Values Optimal Value

Attention # Heads {4, 8} 4 # Landmarks {8} 8 Window factor {4} 4 Window size {7, 9} 7 Overlapping windows {True, False} False Attention dropout {0, 0.3, 0.5} 0 Projection dropout {0} 0

Table 9. Search space and optimal values for optimization hyperparameters in the training strategy.

Component Hyperparameter Possible Values Optimal Value

Loss Function λ {0.0001, 0.001} 0.001

Adam Optimizer Learning Rate {0.0001, 0.001, 0.01} 0.001 Learning Rate Gamma {0.9, 0.999, 0.9995} 0.9995

Feature Shift Localization Network

F. Extended Experimental Results

F.1. Median-based Feature Shift Localization Performance and Runtime Comparison

Figure 3 presents a median-based analysis of feature shift localization performance and runtime, complementing the mean-based evaluation in the main text. For each method, F-1 scores and runtimes are first averaged across manipulation types and fractions of manipulated features, then aggregated using the median across datasets. This approach eliminates the need for imputation in cases where slower methods fail to complete for some datasets, enabling fairer comparisons across a broader set of methods. However, the mean-based results remain more representative for consistently successful approaches like FSL-Net and Data Fix, which completed all evaluations. Figure 3a summarizes the resulting median F-1 scores, while Figure 3b and Figure 3c visualize the trade-off between performance and computational cost, plotting median F-1 scores against median and maximum runtime, respectively. In these plots, higher positions indicate better localization accuracy, while positions further to the left denote lower runtime. These comparisons again confirm that FSL-Net matches Data Fix in localization accuracy while offering a substantial advantage in runtime efficiency. Despite being only slightly slower than Select KBest, FSL-Net far exceeds this method in accuracy. Meanwhile, methods such as MRMR and FAST-CMIM continue to struggle with scalability and performance under median-based evaluation. MB-SM, MB-KS, KNN-KS, and Deep-SM show higher F-1 scores when evaluated by the median rather than the mean, with KNN-KS performing the best among them. However, all of these methods still lag behind FSL-Net, despite benefiting from access to the ground truth |C| a condition rarely met in practice and suffer from extremely poor scalability on large datasets. Figure 3d plots the mean runtime against the product of sample and feature sizes for each dataset. FSL-Net maintains a significantly lower mean runtime than Data Fix across datasets of varying dimensionality. Figure 3e presents median F-1 scores by manipulation type, while Figure 3f shows mean F-1 scores by dataset. FSL-Net matches or surpasses the performance of competing methods across nearly all manipulation types, with two exceptions: E9, where Data Fix achieves higher performance, consistent with mean-based results, and E4, where both Data Fix and KNN-KS exhibit superior accuracy. Notably, FSL-Net demonstrates a clear advantage on high-dimensional datasets such as Phenotypes, Founders, and Canine, outperforming all baselines and highlighting its effectiveness in scenarios involving large feature sets.

Feature Shift Localization Network

select Kbest

0 5 10 15 20 Median Runtime (hours)

select Kbest

0 5 10 15 20 25 30 Max Runtime (hours)

FSL-Net Data Fix

select Kbest

phenotypes,

founders mnist

Mean Runtime (hours)

Manipulation Type

0.0 0.2 0.4 0.6 0.8 1.0

0.0 0.2 0.4 0.6 0.8 1.0

FSL-Net Data Fix

MI select Kbest

MRMR FAST-CMIM

MB-SM MB-KS

KNN-KS Deep-SM

Figure 3. Performance and runtime comparison across feature shift localization methods: a) median F-1 scores across manipulation types, fractions of manipulated features, and datasets; b) median F-1 scores vs. median runtime; c) median F-1 scores vs. maximum runtime; d) mean runtime vs. sample-feature size product per dataset; e) median F-1 scores by manipulation type; f) mean F-1 scores by dataset.

Feature Shift Localization Network

F.2. Performance and Computational Efficiency of FSL-Net and Data Fix

Figure 4 displays the mean F-1 scores (top), mean runtimes (middle), and maximum runtimes (bottom) for both FSL-Net and Data Fix, computed across various manipulation types and fractions of manipulated features. FSL-Net exhibits comparable feature shift localization performance to Data Fix on the majority of datasets, with a notable advantage on large datasets such as Phenotypes, Founders, and Canine, while also demonstrating significantly greater computational efficiency. The pronounced disparity between the mean and maximum runtimes of Data Fix on certain datasets suggests that its processing time is highly sensitive to the complexity of the shifts, potentially necessitating multiple iterations to achieve convergence. In contrast, FSL-Net executes only a single forward pass through the network, ensuring scalability to both high-dimensional and large datasets.

Mean Runtime (hours)

Max Runtime (hours)

FSL-Net Data Fix

Figure 4. Mean F-1 scores (top), mean runtimes (middle), and maximum runtimes (bottom) of FSL-Net and Data Fix by dataset.

Feature Shift Localization Network

F.3. Extended Evaluation of FSL-Net and Data Fix on CIFAR10 and COIL-100

Table 10 presents an extended evaluation of FSL-Net and Data Fix on two image datasets: CIFAR10 (10k samples) and COIL-100. We report the mean F-1 score in feature shift localization and the mean and maximum runtime (in hours), computed across manipulation types and fractions of manipulated features. FSL-Net achieves higher F-1 scores on average and consistently outperforms Data Fix in terms of computational efficiency, exhibiting significantly lower runtimes in both average and worst-case scenarios.

Table 10. Comparison of FSL-Net and Data Fix averaged across manipulation types and fractions of manipulated features on two datasets: CIFAR10 (10k) and COIL-100.

Dataset F-1 Score Mean Runtime (hours) Max Runtime (hours)

FSL-Net Data Fix FSL-Net Data Fix FSL-Net Data Fix

CIFAR10 0.9565 0.8911 0.0319 0.2948 0.0336 0.8084 COIL-100 0.9720 0.9805 0.4906 2.9228 0.8406 9.1567

Feature Shift Localization Network

F.4. Runtime Improvement of FSL-Net over Data Fix

Figure 5 presents the mean runtime improvement of FSL-Net over Data Fix across datasets, sorted by increasing dataset size (measured as the product of the number of samples and the number of features). The speedup factor defined as the ratio of Data Fix s runtime to FSL-Net s measures the performance gain, with higher values indicating greater efficiency. The results are averaged across manipulation types and fractions of manipulated features. On average, FSL-Net achieves a substantial speedup of 35.8 (indicated by the dashed line), consistently outperforming Data Fix across datasets. Among all datasets, the high-dimensional Phenotypes dataset exhibits the greatest speedup, with FSL-Net outperforming Data Fix by a remarkable 136.3 .

Phenotypes,

Runtime Improvement

FSL-Net Runtime Improvement over Data Fix Mean

Figure 5. Runtime improvement of FSL-Net over Data Fix across datasets, sorted by dataset size (number of samples number of features). The speedup factor indicates how many times faster FSL-Net is compared to Data Fix. The dashed line marks the average speedup of 35.8 .

Feature Shift Localization Network

F.5. Performance of the Threshold-Based Variant of the SM-only Baseline

We evaluate a simplified (though constrained) configuration of FSL-Net that eliminates the Prediction Network and instead relies solely on thresholding the combined statistical measures, computed as the normalized squared differences between the reference and query. In this threshold-based approach, the statistical measures are reduced to a scalar score by taking either the mean or the maximum of these differences. Figure 6 shows the mean F-1 scores for the SM-only variant of FSL-Net, where feature shift localization is performed entirely through thresholding. The F-1 scores are averaged across datasets, manipulation types, and fractions of manipulated features. The best performance is achieved using the mean-based method with a threshold of 0.002, yielding an optimal F-1 score of 0.307.

10 4 10 3 10 2 10 1 100 101

F1 Score: 0.307 Threshold: 0.002

Mean (SM) Max (SM)

Figure 6. Mean F-1 scores of the SM-only variant of FSL-Net, where the Prediction Network is omitted and feature shift localization relies solely on thresholding the statistical measures. The F-1 scores are averaged across datasets, manipulation types, and fractions of manipulated features. Statistical measures are reduced to scalar values taking either the mean or maximum of the normalized squared differences between the reference and the query.

Feature Shift Localization Network

F.6. Qualitative Evaluation of FSL-Net on the MNIST Dataset

Figure 7 provides a qualitative assessment of the performance of FSL-Net on the MNIST dataset under various manipulation types. The first subfigure (Figure 7a) illustrates the ability of FSL-Net to identify manipulated features in five sample images. Each row represents a distinct manipulation type applied to modify 5% of the image features (i.e., pixels), with the top row corresponding to E5, the middle row to E8, and the bottom row to E1. Correctly detected manipulations are highlighted in green, while undetected modifications (false negatives) and misclassified pixels (false positives) are colored red and orange, respectively. The results indicate that FSL-Net effectively detects manipulation E1, while its performance is less consistent for manipulations E5 and E8. The difficulty in detecting E5 comes from its binary rounding of pixel values to 0 or 1, often blending seamlessly into the digit structure. Similarly, E8, which randomly permutes pixel values, disrupts spatial coherence without necessarily creating visually distinct artifacts, making detection more challenging.

The second subfigure (Figure 7b) presents the Mean Squared Error (MSE) between the reference and query statistical functional maps derived from statistical measures (including the mean and standard deviation), the Moment Extraction Network, and the Neural Embedding Network. Each bar represents a distinct feature, with color coding indicating classification outcomes: true positive (green), false negative (red), false positive (orange), and true negative (blue). Manipulation E1 (bottom row) is easily detectable, exhibiting high MSE values, particularly in the statistical functional maps derived from the mean, the Moment Extraction Network, and the Neural Embedding Network. In contrast, more subtle manipulations, such as E5 (top row) and E8 (middle row), display smaller MSE values between reference and query for certain features, making their detection more challenging. Still, several features are correctly classified for these challenging manipulations, with the Neural Embedding Network and the Moment Extraction Network producing the most distinguishable statistical functional maps. The mean and standard deviation measures are ineffective in detecting manipulation E8, as it only affects correlation, with low MSE values for these measures.

Feature Shift Localization Network

Manipulation Type: E5

Manipulation Type: E8

Manipulation Type: E1

True Positives False Negatives False Positives

(a) Comparison of FSL-Net output for four examples from the MNIST dataset, with manipulation type E5 (top row), E8 (middle row), and E1 (bottom row). In each case, 5% of the image features were manipulated. Green rectangles indicate correctly predicted corrupted features, red rectangles denote false negatives, and orange rectangles represent false positives.

0 200 400 600 800 Feature Index

Manipulation Type: E5

Statistical Measures

0 200 400 600 800 Feature Index

Moment Extraction

0 200 400 600 800 Feature Index

0 200 400 600 800 Feature Index

0 200 400 600 800 Feature Index

Neural Embedding

0 200 400 600 800 Feature Index

Manipulation Type: E8

Statistical Measures

0 200 400 600 800 Feature Index

Moment Extraction

0 200 400 600 800 Feature Index

0 200 400 600 800 Feature Index

0 200 400 600 800 Feature Index

Neural Embedding

0 200 400 600 800 Feature Index

Manipulation Type: E1

Statistical Measures

0 200 400 600 800 Feature Index

Moment Extraction

0 200 400 600 800 Feature Index

0 200 400 600 800 Feature Index

0 200 400 600 800 Feature Index

Neural Embedding

True Positives False Negatives False Positives True Negatives

(b) Mean Squared Error (MSE) between reference and query statistical functional maps derived from statistical measures, the Moment Extraction Network, and the Neural Embedding Network. Each bar represents a feature, with colors indicating classification outcomes. The bottom row corresponds to manipulation type E1, while the middle and top rows correspond to the more challenging manipulation types E8 and E5, respectively.

Figure 7. Qualitative evaluation of FSL-Net on the MNIST dataset.