# datasuite_datacentric_identification_of_indistribution_incongruous_examples__d1dde27d.pdf

Data-SUITE: Data-centric identiﬁcation of in-distribution incongruous examples

Nabeel Seedat 1 Jonathan Crabb e 1 Mihaela van der Schaar 1 2 3

Abstract Systematic quantiﬁcation of data quality is critical for consistent model performance. Prior works have focused on out-of-distribution data. Instead, we tackle an understudied yet equally important problem of characterizing incongruous regions of in-distribution (ID) data, which may arise from feature space heterogeneity. To this end, we propose a paradigm shift with Data-SUITE: a datacentric AI framework to identify these regions, independent of a task-speciﬁc model. Data-SUITE leverages copula modeling, representation learning, and conformal prediction to build featurewise conﬁdence interval estimators based on a set of training instances. These estimators can be used to evaluate the congruence of test instances with respect to the training set, to answer two practically useful questions: (1) which test instances will be reliably predicted by a model trained with the training instances? and (2) can we identify incongruous regions of the feature space so that data owners understand the data s limitations or guide future data collection? We empirically validate Data-SUITE s performance and coverage guarantees and demonstrate on cross-site medical data, biased data, and data with concept drift, that Data-SUITE best identiﬁes ID regions where a downstream model may be reliable (independent of said model). We also illustrate how these identiﬁed regions can provide insights into datasets and highlight their limitations.

1. Introduction Machine learning models have a well-known reliance on training data quality (Park et al., 2021). Hence, when deploying such models in the real world, the reliability of

1Department of Applied Mathematics and Theoretical Physics, University of Cambridge, UK 2The Alan Turing Institute, London, UK 3University of California, Los Angeles, USA. Correspondence to: Nabeel Seedat <ns741@cam.ac.uk>.

Proceedings of the 39 th International Conference on Machine Learning, Baltimore, Maryland, USA, PMLR 162, 2022. Copyright 2022 by the author(s).

predictions depends on the data s congruence with respect to the training data. Signiﬁcant literature has focused on identifying data instances that lie out of the training data s distribution (OOD). This includes label shifts (Ren et al., 2018; Hsu et al., 2020) or input feature shift, where these instances fall out of the support of the training set s distribution (Zhang et al., 2021). However, a much less studied, yet equally important problem is identifying heterogeneous regions of in-distribution (ID) data.

Data in the wild can be ID yet have heterogeneous regions in feature space. This manifests in varying levels of incongruence, in cases of different sub-populations, data biases or temporal changes (Leslie et al., 2021; Gianfrancesco et al., 2018; Obermeyer et al., 2019). We illustrate each of these types of incongruence with real world data (Table 2), in the experiments from Secs. 4.3 and 4.5.

In this paper, we present a data-centric framework to characterize such incongruous regions of ID data and deﬁne two groups, namely (i) inconsistent and (ii) uncertain, with respect to the training distribution. We contextualize the difference based on conﬁdence intervals (CI) (See Sec.3.3 for details). When feature values lie outside of a CI, we term it inconsistent, alternatively we characterize the level of feature uncertainty based on the CI s width.

At this point, one might ask if the data is ID; why should we worry? Not accounting for these incongruous ID regions of the feature space can be problematic when deploying models in high-stakes settings such as healthcare, where spurious model predictions can be deadly (Saria & Subbaswamy, 2019; Varshney, 2020). That said, even in settings where poor predictions are not risky, consistent exploratory data analysis (EDA) and retroactive auditing of such data is timeconsuming for data scientists (Polyzotis et al., 2017; Kandel et al., 2012). Hence, systematically identifying these incongruous regions has immense practical value.

Consequently, we build a framework to empower data scientists to address the previously mentioned challenges related to insightful exploratory data analysis (EDA) and reliable model deployment, anchored by the following desiderata: (D1) Insightful Data Exploration: Alice has a new dataset D and wants to explore and gain insights into it with respect

Data-SUITE: Data-centric identiﬁcation of in-distribution incongruous examples

Feature 1 Feature 2

Feature 1 Feature 2

CERTAIN INSTANCES (Small CIs)

UNCERTAIN INSTANCES (Large CIs)

INCONSISTENT INSTANCES (Outside CI)

Feature 1 Feature 2

Sample 2 Out-of-distribution (OOD)

In-distribution (ID)

(a) ALICE (D1): Data Exploration

(b) BOB (D2): Model Deployment

Model-Centric

Uncertainty

(Model Dependent)

Data-Centric

Uncertainty

(Model Independent)

Predicted Value + Predictive uncertainty

Figure 1. Illustration highlighting two problems Data-SUITE addresses

to a training set, without necessarily training a model. It would be useful if, independent of a predictive model, she could both identify the incongruous regions of the feature space (e.g., sub-population bias or under-representation), as well as, obtain easily digestible prototype examples of each region. This could guide where to collect more data and, if this is not possible, to understand the data s limitations. (D2) Reliable Model Deployment: Bob has a trained model f and now deploys it to another site. For new data Dtest, it would be useful if he could identify incongruous regions, for which he should NOT trust f to make predictions. (D3) Practitioner conﬁdence: Both Alice and Bob want to feel conﬁdent when using any tool. Guarantees of coverage of predictive intervals (e.g. CIs) could assist in this regard.

These examples, shown in Fig. 1 highlight the need to understand incongruence in data. As we shall discuss in the related work, there has been signiﬁcant work on uncertainty estimation, with a focus on the uncertainty of a model s predictions (model-centric). Estimating predictive uncertainty can address Bob s use-case (D2), however since it requires a predictive model it is not naturally suited to Alice s insights use-case (D1). Further, most predictive uncertainty methods do not provide coverage guarantees (D3).

Therefore, in satisfying all the desiderata, we take a different approach and advocate for a data-centric approach, where we model the uncertainty in the data features 1. This is different from model-centric predictive uncertainty, as we construct CIs (at feature level), without reference to any downstream model. A beneﬁt of the ﬂexibility is that we can ﬂag instances and draw insights that are not model-speciﬁc (i.e. model independent).

To ensure clarity, we note that the term data-centric is used in the context of data-centric AI, which we deﬁne as tools applied to the underlying data used to train and evaluate models . We note that both paradigms of ML (model and data-centric) rely on models, differing based on how mod-

1Here, feature uncertainty refers to the degree of incongruity with the training distribution, rather than the uncertainty of the measured value (e.g. measurement noise)

els are used. In model-centric ML : models are used for predictive tasks and in data-centric ML : models are used to study/evaluate the data itself, which does not preclude using algorithms to process the data. Our deﬁnition is consistent with (Polyzotis et al., 2017) data-centric AI as the problem of (...) [A]nd quality monitoring processes for datasets . This paper ﬁts this as a systematic tool for data quality evaluation based on uncertainty.

In this work, we focus on tabular data, a common format in medicine, ﬁnance, manufacturing etc, where data is based on relational databases (Borisov et al., 2021; Yoon et al., 2020). That said, compared to image data, tabular data has an added challenge since speciﬁc features may be uncertain while others are not; hence characterizing an instance as a whole is non-trivial.

Contributions. We present Data Searching for Uncertain and Inconsistent Test Examples (Data-SUITE), a datacentric framework to identify incongruous regions of data using CI s and make the following contributions: Data-SUITE is a paradigm shift from model-centric uncertainty and, to the best of our knowledge, the ﬁrst to characterize ID regions in a systematic data-centric, modelindependent manner. Not only is this more ﬂexible, but also enables us to gain insights which are not model-speciﬁc. Data-SUITE s pipeline-based approach to construct feature-wise CIs enables speciﬁc properties (Sec. 3.2) that permit us to ﬂag uncertain and inconsistent instances, making it possible to identify incongruous data regions. Data-SUITE s performance and properties, such as coverage guarantees, are validated to satisfy D3 (Sec. 4.1). Further motivating the paradigm shift, we empirically highlight the performance beneﬁt of a data-centric approach compared to a model-centric approach (Sec. 4.2). As a portrayal of reliable model deployment (D2), we show on real-world datasets with different types of incongruence, that Data-SUITE best identiﬁes incongruous data regions, translating to the best performance improvement. (Sec. 4.3). Finally, we illustrate with multiple use-cases how Data SUITE can be used as a model-independent tool to facilitate insightful data exploration, hence satisfying D1 (Sec. 4.5).

Data-SUITE: Data-centric identiﬁcation of in-distribution incongruous examples

2. Related work This paper primarily engages with the literature on uncertainty quantiﬁcation and contributes to the nascent area of data-centric AI. We also highlight the key differences of our work with the literature on noisy labels.

Uncertainty quantiﬁcation. There are numerous Bayesian and non-Bayesian methods for uncertainty quantiﬁcation, including Gaussian processes (Williams & Rasmussen, 2006), Quantile Regression (Koenker & Hallock, 2001), Bayesian Neural Networks (Ghosh et al., 2018; Graves, 2011), Deep Ensembles (Lakshminarayanan et al., 2017), Dropout (Gal & Ghahramani, 2016; Chan et al., 2020) and Conformal Prediction (Vovk et al., 2005). These methods typically assess predictive uncertainty, i.e., measuring the certainty in the model s prediction (Seedat & Kanan, 2019). The predominant focus on predictive uncertainty is different from the notion of uncertainty in our setting, which is feature (i.e. data) uncertainty. We speciﬁcally highlight that we quantify data uncertainty, independent of a task-speciﬁc model. Additionally, the aforementioned methods often do not assess the coverage or provide guarantees of the uncertainty interval (Wasserman, 2004; Alaa & Van Der Schaar, 2020) (i.e., how often the interval contains the true value). The concept of coverage will be outlined further in Secs. 3 and 4.

Data-Centric AI. Ensuring high data quality is a critical but often overlooked problem in ML, where the focus is optimizing models (Sambasivan et al., 2021; Jain et al., 2020). Even when it is considered, the process of assessing datasets is adhoc or artisanal (Sambasivan et al., 2021; Ng et al., 2021). However, there has been recent discussion around data-centric AI (DCAI), which we deﬁne as tools applied to the underlying data used to train and evaluate models, independent of the task-speciﬁc, predictive models. Our work contributes to this nascent body of work presenting Data-SUITE, which, to the best of our knowledge, is the ﬁrst systematic data-centric framework to model uncertainty in datasets. Speciﬁcally, we model the uncertainty in the feature (data) values themselves (data-centric), which contrasts to modeling the uncertainty in predictions (model-centric). We also highlight a tangential of classical data management (Kumar et al., 2017), which does not consider uncertainty and randomness in the data for subsequent analysis. An exception is probabilistic databases (Suciu et al., 2011).

Noisy labels. Learning with noisy data is a widely studied problem, we refer the reader to (Algan & Ulusoy, 2021; Song et al., 2020) for an in depth review. In machine learning, the focus is label noise. We argue that work on noisy labels is not directly related, as the goal is to learn a model robust to the label noise, which is different from our goal of modeling the uncertainty in the features. Additionally, methods are often coupled to the task-speciﬁc predictive model, which is different from our model-independent setting.

3. Data-SUITE In this section, we give a detailed formulation of Data SUITE 2,3. We start with a problem formulation and outline the motivation for working with feature conﬁdence intervals (CIs). Then, we describe how these CIs are built by leveraging copula modelling, representation learning and conformal prediction. Finally, we demonstrate how these CIs permit to ﬂag uncertain and inconsistent instances.

3.1. Preliminaries We consider a feature space X = Qd X i=1[ai, bi] Rd X, where [ai, bi] is the range for feature i. Note that we make the range of each feature explicit, this will be necessary in the deﬁnition of our formalism. We assume that we have a set of M N training instances Dtrain = {xm | m [M]} sampled from an unknown distribution P, where [M] denotes the positive integers between 1 and M. These instances typically correspond to training data for a model on a downstream task, such as classiﬁcation.

We assume that we are given new test instances Dtest. Our purpose is to ﬂag the subset of instances from Dtest that are quantitatively different from instances of Dtrain without necessarily being OOD. To that aim, we use Dtrain to build CIs [li(x), ri(x)] [ai, bi] for each feature i [d X] of each test instance x Dtest. As we will show in Sec. 3.3, these CIs permit to systematically ﬂag test instances whose features are uncertain or inconsistent with respect to Dtrain. For now, let us motivate the usage of feature CIs: (1) With a model of uncertainty and inconsistency at the feature level, it is possible to identify regions of the feature space X where bias and/or low coverage occurs with the training data Dtrain. (2) Since CIs are built with Dtrain and without reference to any predictive downstream model, the ﬂagged instances in Dtest are likely to be problematic for any downstream model trained on top of Dtrain. Hence, we are able to draw conclusions that are not model-speciﬁc. These two points are illustrated in our experiments from Sec. 4. Let us now detail how the CIs are built.

3.2. Feature CIs We now build CIs [li(x), ri(x)] [ai, bi] for each feature i [d X] of each test instance x Dtest. It goes without saying that the CIs should satisfy some properties, i.e. (P1) Coverage: We would like to guarantee that the feature xi of an instance x P lies within the interval such that E 1xi [li(x),ri(x)] 1 α where the signiﬁcance level α (0, 1) can be chosen. In this way, a feature out of the CI hints that x is unlikely to be sampled from P at the given signiﬁcance level. This is then considered across all features to characterize the instance (see Sec. 3.3).

2https://github.com/seedatnabeel/Data-SUITE 3https://github.com/vanderschaarlab/mlforhealthlabpub/ tree/main/alg/Data-SUITE

Data-SUITE: Data-centric identiﬁcation of in-distribution incongruous examples

Figure 2. Outline of our framework Data-SUITE.

(P2) Instance-wise: The CI should be adaptive at an instance level. i.e, we do not wish ri(x) li(x) to be constant w.r.t x X. In this way, the CIs permit to order various test instances x Dtest according to their uncertainty. This property is particularly desirable in healthcare settings where we wish to quantify variable uncertainty for individual patients, rather than for the population as whole.

(P3) Feature-wise: We build CIs [li(x), ri(x)] for each feature i [d X] as opposed to an overall conﬁdence region R(x) X. While less general than the latter approach, feature-wise CIs are more interpretable, allowing attribution of inconsistencies and uncertainty to individual features.

(P4) Downstream coupling: Instances with smaller CIs are more reliably predicted by a downstream model trained on Dtrain. More precisely, our CIs should have a negative correlation between CI width and downstream model performance. In this way, CIs allow to draw conclusions about the incongruence of test instances x Dtest.

To construct feature CIs that satisfy these properties, we introduce a new framework leveraging copula modeling, representation learning and conformal prediction. The blueprint of our method is presented in Fig. 2. Concretely, our method relies on 3 building blocks: a generator that augments the initial training set Dtrain; a representer that leverages the augmented training set D+ train to learn a low-dimensional representation f : X H of the data and a conformal predictor that predicts instance-wise feature CIs [li(x), ri(x)] on the basis of each instance s representation f(x) H. By construction of the CIs, this method fulﬁlls properties (P2) and (P3). As we will see in the following, the conformal predictor s theoretical properties guarantees (P1). However, we also empirically validate (P1), see Sec 4.1. We then also demonstrate (P4) empirically in Sec. 4. Appendix C.1 quantiﬁes the signiﬁcance of each block via an ablation study. Let us now detail each block.

Generator. The purpose of the generator is to augment the initial training set Dtrain with instances that are consistent with the initial distribution P. Many data augmentation techniques can be used for this block. Since our focus is on tabular data, we found copula modeling to be particularly useful. Copulas leverage Sklar s theorem (Sklar, 1959) to estimate multivariate distributions with univariate marginal

distributions. In our case, we use vine copulas (Bedford & Cooke, 2001) to build an estimate ˆP for the distribution P on the basis of Dtrain. We then build an augmented training set D+ train by sampling from the copula density ˆP. Interestingly, our method does not need to access Dtrain once the copula density ˆP is available. It is perfectly possible to use only instances from ˆP to build the augmented dataset D+ train. This could be useful for data sharing, if the access to the training set Dtrain is restricted to the user. Further details and motivations on copulas is found in Appendix A.2.1. Note that a copula might not be ideal for very high-dimensional (large d X) data in domains such as computer vision or genomics. In those cases, copula modeling can be replaced by domain-speciﬁc augmentation techniques.

Representer. A trivial way to verify the coverage guarantee (P1) would be to use the true values of the features to build the CIs: [li(x), ri(x)] = [xi δ, xi + δ] for some δ R+. The problem with this approach is two-fold: (1) it does not leverage the distribution P underlying the training set Dtrain and (2) it results in an uninformative reconstruction with CIs that does not capture the speciﬁcity of each instance, hence contradicting (P2). To provide a more satisfactory solution, we propose to represent the augmented training data D+ train with a representation function f : X H that maps the data into a lower-dimensional latent representation space H Rd H, d H < d X. The purpose of this representer is to capture the structure of the low-dimensional manifold underlying D+ train. At test time, the conformal predictor (detailed next), uses the lower representations f(x) H to estimate a reconstruction interval for each feature xi. This permits to bring a satisfactory solution to the two aforementioned problems: (1) the CIs are reconstructed in terms of latent factors that are useful to describe the training set Dtrain and (2) the predicted CIs vary according to the representation f(x) H of each test instance x Dtest. In essence, our approach is analogous to autoencoders. As we will explain soon, the crucial difference is the decoding step: our method outputs CIs for the reconstructed input. In this work, we use Principal Component Analysis (PCA), the workhorse for tabular data, to learn the representer f. Note that more general encoder architectures can be used in settings such as computer vision.

Conformal Predictor. We now turn to the core of the problem: estimating feature-wise CIs. As previously mentioned, the CIs [li(x), ri(x)], i [d X] will be computed on the basis of the latent representation f(x) for each x Dtest. The idea is simple: for each feature i [d X], we train a regressor gi : H [ai, bi] to reconstruct an estimate of the initial features xi from the latent representation f(x) of the associated training instance: (gi f)(x) xi. We stress that the regressor gi has no knowledge of the true observed xi but only of the latent representation f(x), as illustrated

Data-SUITE: Data-centric identiﬁcation of in-distribution incongruous examples

Figure 3. Conformal Predictor in Data-SUITE.

in Fig. 3. Of course, the feature regressor by themselves provides point-wise estimates for the features. In order to turn these into CIs, we use conformal prediction as a wrapper around the feature regressor (Vovk et al., 2005).

We formalize our problem in the framework of Inductive Conformal Prediction (for motivations, see Appendix A.2.2). Hence, under the formulation, we start by splitting the augmented training set into a proper training set and a calibration set: D+ train = D+ train2 D+ cal. We use the latent representation of the proper training set, to train the feature regressor gi, i [d X] for the reconstruction task. Then, the latent representation of the calibration set is used to compute the non-conformity score (µ), which estimates how different a new instance looks from other instances.

In practice, we use the absolute error non-conformity score µi(x) = |xi (gi f)(x)|. We obtain an empirical distribution of non-conformity scores {µi(x) | x D+ cal} over the calibration instances for each feature i [d X]. This is used to obtain the critical non-conformity score ϵ, which corresponds to the (|D+ cal| + 1)(1 α) -th smallest residual from the set {µi(x) | x D+ cal} (Vovk, 2013). We then apply the method to any unseen incoming data to obtain predictive CIs for the data point, i.e. [li(x), ri(x)] = [(gi f)(x) ϵ, (gi f)(x) + ϵ]. However, in this form the CIs are constant for all instances, where the width of the interval is determined by the residuals of the most difﬁcult instances (largest residuals).

We adapt our conformal prediction framework to obtain the desired adaptive intervals (P2) using a normalized nonconformity function (γ), see Eq. 1 (Bostr om et al., 2016; Johansson et al., 2015). The numerator is computed as before based on µ, however, the denominator normalizes per instance. We learn the normalizer per feature i [d X].

To do so, we compute the log residuals per feature, for all instances in the respective proper training set Dtrain2. We produce tuples per feature: {(f(x), ln|x (gi f)(x)|) | x Dtrain2}. These are used to train a different model, σi : X R+ (e.g., MLP), to predict the log residuals. We can then apply σi to test instances to capture the difﬁculty in predicting said instance. Note, we apply an exponential to the predicted log residual for the test instance converting

to the true scale and ensuring positive estimates.

γi(x) |xi (gi f)(x)|

σi(x) , (1)

We can then obtain the critical non-conformity score ϵ applied to the empirical distribution of normalized nonconformity scores {γi(x) | x D+ cal}, in the same way as before, based on residuals.

The instance-speciﬁc adaptive intervals are then obtained as per Eq. 2, where g is the underlying feature regressor and σi is the instance-wise normalizing function.

[li(x), ri(x)] = [(gi f)(x) ϵσi(x), (gi f)(x) + ϵσi(x)] (2)

Remarks on theoretical guarantees. Under the exchangeability assumption detailed in the Appendix A.2.2, the validity of coverage guarantees (P2) is fulﬁlled with our deﬁnition. In our implementation, we use α = .05.

3.3. Identifying Inconsistent and Uncertain Instances Now that we have CIs [li(x), ri(x)] [ai, bi] for each feature xi, i [d X] of the instance x, we can evaluate if instances from a dataset falls within the predicted range. If it falls outside the predicted range, we characterize the inconsistency (see Deﬁnition 3.1)

Deﬁnition 3.1 (Inconsistency). Let x Dtest be a test instance for which we construct a (1 α)-CI, [li(x), ri(x)], for each feature xi, i [d X] for some predetermined α (0, 1). For each xi, i [d X], the feature inconsistency is a binary variable indicating if xi falls out of the CI.

νi(x) 1(xi / [li(x), ri(x)]) (3)

The instance inconsistency ν(x) is obtained by averaging over the feature inconsistencies νi(x).

The instance x is inconsistent if the fraction of inconsistent features is above a predetermined threshold4 λ [0, 1]: ν(x) > λ.

There can also be degrees of uncertainty in the feature value for features that fall within the CI, which can reﬂect the instance as a whole. Indeed, if the CI [li(x), ri(x)] is large, the feature xi is likely to fall within its range. Nonetheless, we should keep in mind that large CIs correspond to a large uncertainty for the related feature. This will also typically

4In our implementation, we use λ = 0.5.

Data-SUITE: Data-centric identiﬁcation of in-distribution incongruous examples

happen when the instance x Dtest differs from the training set Dtrain used to build the CI. We now introduce a quantitative measure that expresses the degree of uncertainty of the instance with respect to Dtrain (see Deﬁnition 3.2).

Deﬁnition 3.2 (Uncertainty). Let x Dtest be a test instance for which we construct a (1 α) CI [li(x), ri(x)] for each feature xi, i [d X] for a predetermined α (0, 1). For each xi, i [d X], we deﬁne feature uncertainty i(x) as the feature CI width normalized by the feature range:

i(x) ri(x) li(x)

The instance uncertainty (x) is obtained by averaging over all feature uncertainties:

i=1 i(x) (0, 1].

Remark. Instance uncertainties are strictly larger than zero, as feature uncertainties are computed over all features. Hence, this characterization offers a natural split between certain and uncertain instances if we sort the instances based on uncertainty.

4. Experiments This section presents a detailed empirical evaluation demonstrating that Data-SUITE satisﬁes (D1) Insightful Data Exploration,(D2) Reliable Model Deployment and (D3) Practitioner conﬁdence, introduced in Sec. 1. We tackle these in reverse order, as practitioner conﬁdence is a prerequisite for the adoption of D1 and D2. Recall that the notion of uncertainty in Data-SUITE is different from predictive uncertainty (model-centric). We empirically compare these two paradigms using methods for predictive uncertainty. That said, a natural additional question is whether model-centric uncertainty estimation methods can simply be applied in this setting and provide uncertainty estimates for feature values. We benchmark the following widely used Bayesian and non-Bayesian methods (under BOTH model-centric & data-centric paradigms): Bayesian Neural Networks (BNN) (Ghosh et al., 2018), Deep Ensembles (ENS) (Lakshminarayanan et al., 2017), Gaussian Processes (GP) (Williams & Rasmussen, 2006), Monte-Carlo Dropout (MCD) (Gal & Ghahramani, 2016) and Quantile Regression (QR) (Koenker & Hallock, 2001). We also ablate and test Data-SUITE s constituent components independently: conditional sampling from copula (COP), Conformal Prediction on raw data (CONF) (Vovk et al., 2005; Balasubramanian et al., 2014). For implementation details, see Appendix B.1.

4.1. Validating coverage & comparing properties We ﬁrstly wish to validate the CIs to ensure that the coverage guarantees are satisﬁed such that users can have conﬁdence

i. Coverage & 95% guarantee (1-α)

iii. Excess

Figure 4. Comparison of methods based on coverage, deﬁcit and excess under various conﬁgurations (Da, Db, Dc)

that the true value lies within the predicted CIs (D3). We assess the CIs based on the following metrics deﬁned in (Navratil et al., 2020) - (1) Coverage: how often the CI contains the true value, (2) Deﬁcit: extent of CI shortfall (i.e., the severity of the errors) and (3) Excess: extent of CI excess width to capture the true value.

Coverage = E 1xi [li,ri] (5)

Deﬁcit = E 1xi / [li,ri] min {|xi li| , |xi ri|} (6)

Excess = E 1xi [li,ri] min {xi li, ri xi} (7)

Synthetic data. We assess the properties of different methods using synthetic data as the ground truth values are available, even when encoding incongruence. The synthetic data with features, X = [X1, X2, X3], is drawn IID from a multivariate Gaussian distribution, parameterized by mean vector µ and a positive deﬁnite covariance matrix P (details in Appendix B.2). We sample n = 1000 points for both Dsynth train and Dsynth test and encode incongruence into Dsynth test using a multivariate additive model ˆX = X+Z, where Z Rn m, is the perturbation matrix. We conduct experiments with different conﬁgurations: (1) Da: Multivariate Gaussian with variance 2 and varying proportion of perturbed instances. (2) Db: Multivariate Gaussian with varying variance and ﬁxed proportion of perturbed instances (50%) and (3) Dc: Varying distribution {Beta, Gamma, Normal, Weibull}.

Fig. 4 outlines mean coverage, deﬁcit and excess averaged over ﬁve runs for Dsynth test under the different conﬁgurations

Data-SUITE: Data-centric identiﬁcation of in-distribution incongruous examples

(Da, Db, Dc). There is a clear variability amongst the different methods, suggesting speciﬁc methods are more suitable. Data-SUITE outperforms the other methods based on coverage and deﬁcit across all conﬁgurations. We note the methods with poor coverage are typically incorrectly conﬁdent, i.e. small intervals with low coverage and high deﬁcit. Fig. 4 also demonstrates a meaningful relationship that coverage and deﬁcit are inversely related (high coverage is associated with the low deﬁcit), as with deﬁcit and excess. Although high coverage and low deﬁcit ideally occur with low excess, we observe that high levels of coverage occur in conjunction with high levels of excess. Critically, however, in satisfying D3, Data-SUITE maintains the 95% coverage guarantees across all conﬁgurations, unlike other methods.

4.2. Synthetic data stratiﬁcation w/ downstream task While it is essential to validate a method s properties, the most useful goal is whether the intervals can be used to identify instances that will be reliably predicted by a downstream predictive model. With Data-SUITE, we stress that this is done in a model-independent manner (i.e. no knowledge of the downstream model). We train a downstream regression model using Dsynth train , where features X1, X2 are used to predict X3. We ﬁrst compute a baseline mean squared error (MSE) on a held-out validation set of Dsynth train and the complete test set Dsynth test ( ˆX = X + Z). Thereafter, we construct predictive intervals for Dsynth test using all benchmark methods (either uncertainty intervals or CIs). The intervals are then used to sort instances based on width.

In addition, we answer the question of whether a datacentric or model-centric approach yields the best performance. For data-centric paradigm, we construct intervals for the features X1, X2, hence instances are categorized in a model-independent manner based on data-level CIs. In contrast, the model-centric paradigm is tightly coupled with a task-speciﬁc model, categorizing instances using predictive uncertainty based on prediction X3.

We then compute MSE for the 100 most certain instances as ranked by each method (smallest widths). For Data-SUITE, we also compute the MSE for those instances identiﬁed as inconsistent (outside CIs). The best method is one in which the certain sorted instances produce MSE values closest to the clean train MSE (baseline) i.e. has the lowest MSE.

Table 1 shows the MSE for conﬁgurations Da and Db. As one example of satisfying D2, Data-SUITE has the best performance and identiﬁes the top 100 certain instances that yields the best downstream model performance, with the lowest MSE across all conﬁgurations. In addition, as expected the inconsistent instances are unreliably predicted. The poor performance for ablations of Data-SUITE components, suggests the necessity of the inter-connected framework (more in Appendix C.1)

Table 1. MSE based on instance stratiﬁcation for different methods. Data-SUITE outperforms other methods, whilst data-centric methods in general outperform model-centric methods

Proportion (Da) Variance (Db) PERTURBATION .1 .25 .5 1 2 Train Data (BASELINE) .067 .059 .068 .065 .068 Test Data .222 .513 .889 .275 .889

Data-centric

Data-SUITE (All, Uncertainty) .069 .122 .197 .104 .197 Data-SUITE (All, Inconsistent) .595 1.608 2.322 .791 2.322 Data-SUITE (CONF) .125 .396 .846 .293 .846 Data-SUITE (COP) .220 .277 .451 .236 .451 BNN .192 .216 .704 .173 .704 ENS .125 .311 .565 .204 .565 GP .112 .153 .296 .158 .296 MCD .173 .391 .692 .201 .692 QR .116 .228 .635 .193 .635

Model-centric

BNN (Predictive) .208 .220 .692 .195 .692 ENS (Predictive) .143 .226 .625 .257 .625 GP (Predictive) .147 .472 .584 .237 .584 MCD (Predictive) .206 .255 .684 .213 .684 QR (Predictive) .187 .477 .671 .223 .671

Additionally, we see for the same base methods (e.g. BNN, MCD etc), that the data-centric paradigm outperforms the model-centric paradigm in identifying the best instances to give the lowest MSE. This result highlights the performance advantage of a ﬂexible, model-independent datacentric paradigm compared to the model-centric paradigm.

4.3. Real dataset stratiﬁcation w/ downstream task We now demonstrate how Data-SUITE can be practically used on real data to stratify instances for improved downstream performance (satisfying D2). Speciﬁcally, to assist with more reliable and performant model deployment across a variety of scenarios. To this end, we select three real-world datasets with different types of incongruence as presented in Table 2. For details see Appendix B.3.

Evaluation. We stratify Dtest into certain and uncertain instances based on the interval width predicted by each method. e.g. the most uncertain has the largest width.

Are the identiﬁed instances OOD? At this point, one might be tempted to assume that the identiﬁed instances are simply OOD. i.e. samples that fall out of the support of the training set s distribution. We show in reality that this is unlikely the case. We apply existing algorithms to detect OOD and outliers (for others see (Yang et al., 2021)): Mahalanobis distance (Lee et al., 2018), SUOD (Zhao et al., 2021), COPOD (Li et al., 2020) and Isolation Forest (Liu et al., 2012). For each of the detection methods, we compute the overlap between the predicted OOD/Outlier instances and the uncertain and inconsistent instances as identiﬁed by Data SUITE. We found minimal overlap across methods, ranging between 1-18%. Additionally, the OOD detection methods were often unconﬁdent in their predictions, with average conﬁdence scores ranging between 5-50%. Both results suggest the identiﬁed uncertain and inconsistent instances are unlikely OOD. For more see Appendix C.2.

Each stratiﬁcation method will identify different instances

Data-SUITE: Data-centric identiﬁcation of in-distribution incongruous examples

Table 2. Comparison of real-world datasets

Dataset Incongruence Type Downstream Task Stratiﬁcation

Seer (US) & Cutract (UK) Geographic UK-US (Cross-site medical) Predict mortality from prostate cancer DSeer train : Seer (US), DCut test : Cutract (UK)

Adult Demographic Bias (Gender & Income) Predict income over $50K DAdult train , DAdult test balanced split

Electricity Temporal (Consumption patterns) Predict electricity price rise/fall DElec train :1996 , DElec test : 1997-1998

for each group, hence we aim to quantify which method identiﬁes instances that provide the most improvement to downstream performance. We do this by computing the accuracy of the certain and uncertain stratiﬁcation s, on a downstream random forest trained on Dtrain. Ideally, correct instance stratiﬁcation results in greater accuracy for certain compared to uncertain instances. As an overall comparative metric, we compute the Mean Performance Improvement - MPI (Eq. 8). MPI is the difference in accuracy (Acc) between certain and uncertain instances, as identiﬁed by a speciﬁc method, averaged over different threshold proportions P. The best performing method would clearly identify the most appropriate certain and uncertain instances, which would translate to the largest MPI.

MPI = 1 |P|

p P Acc(Certp) Acc(Uncertp) (8)

where P = {0.05k | k [20] }, Certp = Set of p most certain instances, Uncertp = Set of p most uncertain instances.

Fig. 5 illustrates an example of Data-SUITE, applied to the CUTRACT dataset. The metric MPI (Eq. 8) is the mean difference between certain (green) and uncertain (red) curves. The results demonstrate the performance improvement when evaluating with the stratiﬁed certain and uncertain instances (compared to performance evaluated on the baseline Dtest or random sampling of instances). The result further demonstrates that the identiﬁed inconsistent instances have worse performance when compared to uncertain instances.

Figure 5. Example on CUTRACT of how Data-SUITE instance stratiﬁcation can be used to improve downstream performance, contrasted with baseline Dtest (blue) or random selection (black). Table 3 shows the MPI scores across methods. In satisfying D2, of improving deployed model performance, Data-

Table 3. MPI metric across datasets for different methods

SEER-CUTRACT Adult Electricity

Data-SUITE 0.11 0.015 0.64 0.03 0.26 0.03 BNN 0.08 0.02 -0.15 0.02 -0.005 0.01 CONFORMAL 0.05 0.01 -0.07 0.07 0.12 0.03 ENSEMBLE 0.01 0.02 -0.03 0.02 -0.02 0.02 GP 0.05 0.04 0.56 0.02 0.04 0.04 MCD 0.01 0.01 -0.16 0.01 0.15 0.03 QR -0.10 0.03 0.12 0.06 0.15 0.06

SUITE consistently outperforms other methods, providing the greatest performance improvement, with the lowest variability across datasets. The result suggests that Data-SUITE identiﬁes the most appropriate certain and uncertain instances, accounting for the performance improvement. Overall, the quality of stratiﬁcation by Data-SUITE has not been matched by any benchmark uncertainty estimation method.

4.4. Data-SUITE usage with diverse downstream models Data-SUITE operates to identify instances , independent of the downstream task predictive model. Previous experiments use a ﬁxed downstream model (e.g. RF). We now evaluate a more diverse set downstream models, i.e. RF, MLP, GBT and include a ROBUST model Median-of Means (MOM) estimators (Lecu e et al., 2020)). For each downstream model, we compare the performance on instances identiﬁed by Data-SUITE and baselines. We rank them based on MPI (see Eq.8). i.e larger MPI=higher rank. Ideally, the rank should be invariant across models, showing the instances identiﬁed by Data-SUITE are impactful, no matter the downstream model.

Figure 6. Rank assessment for a diverse set of downstream models, showing Data-SUITE has the most consistent ordering across different models

We show results for the Adult dataset in Fig. 6, with the other datasets included in Appendix C.5. Overall, the results are similar, whereby Data-SUITE s identiﬁcation remains the most appropriate, with consistent highest MPI rank, no matter the model. In contrast, baseline methods rank differently depending on the model. This shows that Data-SUITE identiﬁes impactful instances no matter the downstream task-speciﬁc model.

Data-SUITE: Data-centric identiﬁcation of in-distribution incongruous examples

4.5. Use-Case: Data-SUITE in the hands of users We now demonstrate how users can practically leverage Data-SUITE to better understand their data. We do so by proﬁling the incongruous regions identiﬁed by Data-SUITE and highlight the insights which can be garnered independent of a model. This satisﬁes D1, where the quantitative proﬁling provides valuable insights that could assist data owners to characterize where to collect more data and if this is not possible, to understand the data s limitations.

For visual purposes, we embed the identiﬁed certain and uncertain instances into 2-D low-dimensional space as shown in Fig. 7. We clearly see that the certain and uncertain instances are distinct regions and that they lie in-distribution as evidenced by the embedding projection. This reinforces the quantitative ﬁndings of the previous experiment (i.e., that the identiﬁed instances are not OOD). We further highlight centroid, average prototypes of the certain and uncertain regions as a digestible example of the region, which can easily be understood by stakeholders. For the SEERCUTRACT analysis, in addition to prototypes for DCut test regions, we can also ﬁnd the nearest neighbor SEER (USA) prototypes for each instance. Comparing the average and nearest neighbor prototypes assists us to tease out the incongruence between the two geographic sites.

Overall, Fig. 7, in their respective captions, highlights the most valuable insights, quantitatively garnered on the basis of Data-SUITE across all three datasets. We conduct a more detailed analysis of the regions in Appendix C.3, to outline further potential practitioner usage.

5. Discussion Automation should not replace the expertise and judgment of a data scientist in understanding the data, nor will it replace the ingenuity required to build better models. In this spirit, we developed Data-SUITE and illustrated its capability, across multiple datasets, to empower data scientists to perform more insightful data exploration, as well as, enable more reliable model deployment. We address these use-cases for the understudied problem of in-distribution heterogeneity and propose a ﬂexible data-centric solution. This permits the identiﬁcation of impactful instances, independent of a task-speciﬁc model. Data-SUITE allows to perform stratiﬁcation of test data into inconsistent and uncertain instances with respect to training data. This stratiﬁcation has been shown to be in line with downstream performance and to provide valuable insights for proﬁling incongruent test instances in a rigorous and quantitative way. The quality of this stratiﬁcation by Data-SUITE is not matched by any benchmark uncertainty estimation method (data-centric or not). The promising result serves two roles. 1. The method could be used by practitioners both to improve data exploration, as well as, enable more reliable

FEATURE VALUE

Comorbities 0.1

Treatment RT-RDx

FEATURE VALUE

Comorbities 0.2 Treatment PHT

FEATURE VALUE

Comorbities 0.66

Treatment RT-RDx

FEATURE VALUE

Comorbities 0.5

Treatment RT-RDx

CUTRACT UNCERTAIN

Nearest SEER PROTOTYPE Nearest - SEER

i. SEER-CUTRACT: CUTRACT certain instances are similar to their SEER nearest prototypes, whilst CUTRACT uncertain instances are different to their nearest SEER prototypes (e.g. PSA).

FEATURE VALUE Age 36

Marital Status Single

Race White Sex Male

FEATURE VALUE Age 39

Marital Status Married Race Black Sex Female

TEST UNCERTAIN:

PROTOTYPE TEST CERTAIN

ii. Adult: The certain and uncertain instances, represent two different demographics, aligning with the known dataset biases toward females. The uncertain instances speciﬁcally highlight a sub-group of black, females.

TEST CERTAIN

FEATURE VALUE nswprice 0.069 nswdemand 0.35

vicprice 0.003 vicdemand 0.422

transfer 0.41

TEST UNCERTAIN:

FEATURE VALUE nswprice 0.035 nswdemand 0.41 vicprice 0.002

vicdemand 0.38

transfer 0.53

Mid 1996 Early 1997 Mid 1997 Early 1998

Train Test Certain Test Uncertain

iii. Electricity: The certain instances are similar to the training set in features and time. The uncertain instances identiﬁed represent a later time period, wherein concept drift has likely occurred.

Figure 7. Insights of prototypes identiﬁed by Data-SUITE. Tables describe the average prototypes for certain and uncertain instances.

model deployment. 2. Data-SUITE opens up future avenues to advance the data-centric AI research agenda, taking it a step further by both explaining why instances might be classed as uncertain or inconsistent. Further, how this information could be leveraged either to correct the identiﬁed instances or improve the data collection process to improve overall data quality. Finally, the current formulation has focused on tabular data. For usage with high-dimensional data, we refer the reader to Appendix A.2.3 for proposals on possible modiﬁcations to Data-SUITE.

Acknowledgements The authors are grateful to Fergus Imrie, Zhaozhi Qian, Yuchao Qin, Krzysztof Kacprzyk, Kamile Stankeviciute and the 4 anonymous ICML reviewers for their useful comments & feedback. Nabeel Seedat would like to acknowledge Hameeda Saif for her constant support and feedback. Nabeel Seedat is funded by the Cystic Fibrosis Trust, Jonathan Crabb e by Aviva and Mihaela van der Schaar by the Ofﬁce of Naval Research (ONR), NSF 1722516.

Data-SUITE: Data-centric identiﬁcation of in-distribution incongruous examples

Alaa, A. and Van Der Schaar, M. Discriminative jackknife: Quantifying uncertainty in deep learning via higher-order inﬂuence functions. In International Conference on Machine Learning, pp. 165 174. PMLR, 2020.

Algan, G. and Ulusoy, I. Image classiﬁcation with deep learning in the presence of noisy labels: A survey. Knowledge-Based Systems, 215:106771, 2021.

Asuncion, A. and Newman, D. Uci machine learning repository, 2007.

Balasubramanian, V., Ho, S.-S., and Vovk, V. Conformal prediction for reliable machine learning: theory, adaptations and applications. Newnes, 2014.

Bedford, T. and Cooke, R. M. Probability density decomposition for conditionally dependent random variables modeled by vines. Annals of Mathematics and Artiﬁcial intelligence, 32(1):245 268, 2001.

Borisov, V., Leemann, T., Seßler, K., Haug, J., Pawelczyk, M., and Kasneci, G. Deep neural networks and tabular data: A survey. ar Xiv preprint ar Xiv:2110.01889, 2021.

Bostr om, H., Linusson, H., L ofstr om, T., and Johansson, U. Evaluation of a variance-based nonconformity measure for regression forests. In Symposium on Conformal and Probabilistic Prediction with Applications, pp. 75 89. Springer, 2016.

Breunig, M. M., Kriegel, H.-P., Ng, R. T., and Sander, J. Lof: identifying density-based local outliers. In Proceedings of the 2000 ACM SIGMOD international conference on Management of data, pp. 93 104, 2000.

Chan, A., Alaa, A., Qian, Z., and Van Der Schaar, M. Unlabelled data improves bayesian uncertainty calibration under covariate shift. In International Conference on Machine Learning, pp. 1392 1402. PMLR, 2020.

Duggan, M. A., Anderson, W. F., Altekruse, S., Penberthy, L., and Sherman, M. E. The surveillance, epidemiology and end results (seer) program and pathology: towards strengthening the critical relationship. The American journal of surgical pathology, 40(12):e94, 2016.

Gal, Y. and Ghahramani, Z. Dropout as a bayesian approximation: Representing model uncertainty in deep learning. In international conference on machine learning, pp. 1050 1059. PMLR, 2016.

Ghosh, S., Yao, J., and Doshi-Velez, F. Structured variational learning of bayesian neural networks with horseshoe priors. In International Conference on Machine Learning, pp. 1744 1753. PMLR, 2018.

Gianfrancesco, M. A., Tamang, S., Yazdany, J., and Schmajuk, G. Potential biases in machine learning algorithms using electronic health record data. JAMA internal medicine, 178(11):1544 1547, 2018.

Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., and Bengio, Y. Generative adversarial nets. In Advances in Neural Information Processing Systems 2014, volume 27, 2014a.

Goodfellow, I. J., Shlens, J., and Szegedy, C. Explaining and harnessing adversarial examples. ar Xiv preprint ar Xiv:1412.6572, 2014b.

Graves, A. Practical variational inference for neural networks. Advances in neural information processing systems, 24, 2011.

Gretton, A., Borgwardt, K. M., Rasch, M. J., Sch olkopf, B., and Smola, A. A kernel two-sample test. The Journal of Machine Learning Research, 13(1):723 773, 2012.

Gulrajani, I., Ahmed, F., Arjovsky, M., Dumoulin, V., and Courville, A. Improved training of wasserstein gans. In Proceedings of the 31st International Conference on Neural Information Processing Systems, pp. 5769 5779, 2017.

Gurumoorthy, K. S., Dhurandhar, A., Cecchi, G., and Aggarwal, C. Efﬁcient data representation by selecting prototypes with importance weights. In 2019 IEEE International Conference on Data Mining (ICDM), pp. 260 269. IEEE, 2019.

Haeusser, P., Frerix, T., Mordvintsev, A., and Cremers, D. Associative domain adaptation. In Proceedings of the IEEE international conference on computer vision, pp. 2765 2773, 2017.

Harries, M. and Wales, N. S. Splice-2 comparative evaluation: Electricity pricing. 1999.

Hsu, Y.-C., Shen, Y., Jin, H., and Kira, Z. Generalized odin: Detecting out-of-distribution image without learning from out-of-distribution data. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10951 10960, 2020.

Jain, A., Patel, H., Nagalapatti, L., Gupta, N., Mehta, S., Guttula, S., Mujumdar, S., Afzal, S., Sharma Mittal, R., and Munigala, V. Overview and importance of data quality for machine learning tasks. In Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pp. 3561 3562, 2020.

Joe, H. Dependence modeling with copulas. CRC press, 2014.

Johansson, U., S onstr od, C., and Linusson, H. Efﬁcient conformal regressors using bagged neural nets. In 2015 Inter-

Data-SUITE: Data-centric identiﬁcation of in-distribution incongruous examples

national Joint Conference on Neural Networks (IJCNN), pp. 1 8, 2015. doi: 10.1109/IJCNN.2015.7280763.

Kandel, S., Paepcke, A., Hellerstein, J. M., and Heer, J. Enterprise data analysis and visualization: An interview study. IEEE Transactions on Visualization and Computer Graphics, 18(12):2917 2926, 2012.

Kingma, D. P. and Welling, M. Auto-encoding variational bayes. In 2nd International Conference on Learning Representations, 2014.

Kingma, D. P., Salimans, T., and Welling, M. Variational dropout and the local reparameterization trick. Advances in neural information processing systems, 28:2575 2583, 2015.

Koenker, R. and Hallock, K. F. Quantile regression. Journal of economic perspectives, 15(4):143 156, 2001.

Kumagai, A. and Iwata, T. Unsupervised domain adaptation by matching distributions based on the maximum mean discrepancy via unilateral transformations. In Proceedings of the AAAI Conference on Artiﬁcial Intelligence, volume 33, pp. 4106 4113, 2019.

Kumar, A., Boehm, M., and Yang, J. Data management in machine learning: Challenges, techniques, and systems. In Proceedings of the 2017 ACM International Conference on Management of Data, pp. 1717 1722, 2017.

Lakshminarayanan, B., Pritzel, A., and Blundell, C. Simple and scalable predictive uncertainty estimation using deep ensembles. Advances in neural information processing systems, 30, 2017.

Lecu e, G., Lerasle, M., and Mathieu, T. Robust classiﬁcation via mom minimization. Machine Learning, 109(8): 1635 1665, 2020.

Lee, K., Lee, K., Lee, H., and Shin, J. A simple uniﬁed framework for detecting out-of-distribution samples and adversarial attacks. Advances in neural information processing systems, 31, 2018.

Lei, J., G Sell, M., Rinaldo, A., Tibshirani, R. J., and Wasserman, L. Distribution-free predictive inference for regression. Journal of the American Statistical Association, 113(523):1094 1111, 2018.

Leslie, D., Mazumder, A., Peppin, A., Wolters, M. K., and Hagerty, A. Does ai stand for augmenting inequality in the era of covid-19 healthcare? BMJ, 372, 2021.

Li, Z., Zhao, Y., Botta, N., Ionescu, C., and Hu, X. Copod: copula-based outlier detection. In 2020 IEEE International Conference on Data Mining (ICDM), pp. 1118 1123. IEEE, 2020.

Liu, F. T., Ting, K. M., and Zhou, Z.-H. Isolation-based anomaly detection. ACM Transactions on Knowledge Discovery from Data (TKDD), 6(1):1 39, 2012.

Long, M., Cao, Y., Wang, J., and Jordan, M. Learning transferable features with deep adaptation networks. In International conference on machine learning, pp. 97 105. PMLR, 2015.

Navratil, J., Arnold, M., and Elder, B. Uncertainty prediction for deep sequential regression using meta models. ar Xiv preprint ar Xiv:2007.01350, 2020.

Ng, A., Aroyo, L., Coleman, C., Diamos, G., Reddi, V. J., Vanschoren, J., Wu, C.-J., and Zhou, S. Neurips data-centric ai workshop, 2021. URL https:// datacentricai.org/.

Obermeyer, Z., Powers, B., Vogeli, C., and Mullainathan, S. Dissecting racial bias in an algorithm used to manage the health of populations. Science, 366(6464):447 453, 2019.

Park, C., Awadalla, A., Kohno, T., and Patel, S. Reliable and trustworthy machine learning for health using dataset shift detection. Advances in Neural Information Processing Systems, 34, 2021.

Polyzotis, N., Roy, S., Whang, S. E., and Zinkevich, M. Data management challenges in production machine learning. In Proceedings of the 2017 ACM International Conference on Management of Data, pp. 1723 1726, 2017.

Prostate Cancer UK, C. Prostate cancer uk. URL https: //prostatecanceruk.org/.

Ren, M., Zeng, W., Yang, B., and Urtasun, R. Learning to reweight examples for robust deep learning. In International Conference on Machine Learning, pp. 4334 4343. PMLR, 2018.

Rezende, D. and Mohamed, S. Variational inference with normalizing ﬂows. In International conference on machine learning, pp. 1530 1538. PMLR, 2015.

Sambasivan, N., Kapania, S., Highﬁll, H., Akrong, D., Paritosh, P. K., and Aroyo, L. M. everyone wants to do the model work, not the data work : Data cascades in high-stakes ai. 2021.

Saria, S. and Subbaswamy, A. Tutorial: safe and reliable machine learning. ACM Conference on Fairness, Accountability, and Transparency, 2019.

Seedat, N. and Kanan, C. Towards calibrated and scalable uncertainty representations for neural networks. ar Xiv preprint ar Xiv:1911.00104, 2019.

Shafer, G. and Vovk, V. A tutorial on conformal prediction. Journal of Machine Learning Research, 9(3), 2008.

Data-SUITE: Data-centric identiﬁcation of in-distribution incongruous examples

Sklar, A. Fonctions de r epartition a n dimensions et leurs marges. Publications de l Institut de Statistique de l Universit e de Paris, 8:229 231, 1959.

Song, H., Kim, M., Park, D., Shin, Y., and Lee, J.-G. Learning from noisy labels with deep neural networks: A survey. ar Xiv preprint ar Xiv:2007.08199, 2020.

Srivastava, A., Valkov, L., Russell, C., Gutmann, M. U., and Sutton, C. Veegan: Reducing mode collapse in gans using implicit variational learning. In Proceedings of the 31st International Conference on Neural Information Processing Systems, pp. 3310 3320, 2017.

Suciu, D., Olteanu, D., R e, C., and Koch, C. Probabilistic databases. Synthesis lectures on data management, 3(2): 1 180, 2011.

Varshney, K. R. On mismatched detection and safe, trustworthy machine learning. In 2020 54th Annual Conference on Information Sciences and Systems (CISS), pp. 1 4. IEEE, 2020.

Vovk, V. Transductive conformal predictors. In IFIP International Conference on Artiﬁcial Intelligence Applications and Innovations, pp. 348 360. Springer, 2013.

Vovk, V., Gammerman, A., and Shafer, G. Conformal prediction. Algorithmic learning in a random world, pp. 17 51, 2005.

Wasserman, L. All of statistics: a concise course in statistical inference, volume 26. Springer, 2004.

Williams, C. K. and Rasmussen, C. E. Gaussian processes for machine learning, volume 2. MIT press Cambridge, MA, 2006.

Yan, H., Ding, Y., Li, P., Wang, Q., Xu, Y., and Zuo, W. Mind the class weight bias: Weighted maximum mean discrepancy for unsupervised domain adaptation. In 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 945 954, 2017. doi: 10.1109/CVPR.2017.107.

Yang, J., Zhou, K., Li, Y., and Liu, Z. Generalized out-of-distribution detection: A survey. ar Xiv preprint ar Xiv:2110.11334, 2021.

Yoon, J., Zhang, Y., Jordon, J., and van der Schaar, M. Vime: Extending the success of self-and semi-supervised learning to tabular domain. Advances in Neural Information Processing Systems, 33, 2020.

Zhang, L., Goldstein, M., and Ranganath, R. Understanding failures in out-of-distribution detection with deep generative models. In International Conference on Machine Learning, pp. 12427 12436. PMLR, 2021.

Zhao, Y., Hu, X., Cheng, C., Wang, C., Wan, C., Wang, W., Yang, J., Bai, H., Li, Z., Xiao, C., et al. Suod: Accelerating large-scale unsupervised heterogeneous outlier detection. Proceedings of Machine Learning and Systems, 3, 2021.

Zliobaite, I. How good is the electricity benchmark for evaluating concept drift adaptation. ar Xiv preprint ar Xiv:1301.3524, 2013.

Data-SUITE: Data-centric identiﬁcation of in-distribution incongruous examples

A. Data-SUITE details & related work A.1. Extended Related Work We present a comparison of our framework Data-SUITE, and contrast it to the related work of uncertainty quantiﬁcation and learning with noisy labels. Table 4, highlights both similarities and differences across multiple dimensions. We highlight 3 key features which distinguish Data-SUITE: (1) Data-centric uncertainty is a novel paradigm compared to the predominant model-centric approaches, (2) Our method offers increased ﬂexibility, as it is used independent of task-speciﬁc predictive models. Any conclusions that we draw from Data-SUITE are not model-speciﬁc. (3) Our method provides theoretical guarantees concerning the validity of coverage.

Table 4. Comparison of related work

Data-centric uncertainty Model-centric uncertainty Task Model independent No noise assumptions Coverage guarantees

Data-SUITE (Ours) Uncertainty Quantiﬁcation Noisy labels

A.2. Data-SUITE Details We present a block diagram of our framework Data-SUITE in Figure 8. We next have in-depth discussions on both the generator and conformal predictor. We outline the motivations as well as technical details not covered in the main paper.

Conformal Estimator (Wrapper)

Underlying feature regressor ~

Pseudo samples

Representer:

Learn lowdimensional representation

Generator (e.g. Vine Copula)

Conformal Estimator (Wrapper)

Underlying feature regressor

Representer: Compute

low-dim representation

Instance-wise confidence intervals

Feature 1 Feature 2

Feature 1 Feature 2

Feature 1 Feature 2

CERTAIN INSTANCES

UNCERTAIN INSTANCES

INCONSISTENT INSTANCES

Section 3.2.1

Section 3.2.2 Section 3.2.3

Section 3.2.3 Section 3.2.2

Section 3.3

Figure 8. Outline of our framework Data-SUITE

A.2.1. GENERATOR: COPULAS Motivation. Recall in our formulation, we have a set of training instances Dtrain and we wish to learn the dependency between features X. Hence, we model the multivariate joint distribution of Dtrain to capture the dependence structure between random variables. A signiﬁcant challenge is modeling distributions in high dimensions. Parametric approaches such as Kernel Density prediction (KDE), often using Gaussian distributions, are largely inﬂexible.

On the other hand, nonparametric versions are typically infeasible with complex data distributions and the curse of dimensionality. Additionally, while Variational Autoencoders (VAEs) (Kingma & Welling, 2014) and Generative Adversarial Networks (GANs) (Goodfellow et al., 2014a) can model learn the joint distributions, they both have limitations for our setting. VAEs make strong distributional assumptions (Rezende & Mohamed, 2015), while GANs involve training multiple models, which leads to associated difﬁculties (Srivastava et al., 2017; Gulrajani et al., 2017) (both from training computational burden and inherent to GANs - e.g., mode collapse).

An attractive approach, particularly for tabular data, is Copulas; which ﬂexibly couple the marginal distributions of the different random variables into a joint distribution. One important reason lies in the following theorem:

Data-SUITE: Data-centric identiﬁcation of in-distribution incongruous examples

Property A.1 (Sklar s theorem). A d-dimensional random vector X = (X1, ..., Xd) with joint distribution F and marginal distributions F1, ..., Fd can be expressed as f(X1, ..., Xd) = Cθ{F1(X1)...Fd(Xd)}, where Cθ : [0, 1]d [0, 1] is a Copula. This is attractive in high dimensions, as it separates the learning of univariate marginal distributions from learning of the coupled multivariate dependence structure.

Parametric copulas have limited expressivity; hence we use pair copula constructions (vine copulas) (Bedford & Cooke, 2001), which are hierarchical and express a multivariate copula as a cascade of bivariate copulas.

Copula Details. To learn the copula, we factorize the d-dimensional copula density into a product of d(d 1)/2, bivariate conditional densities by means of the chain rule.

The graphical model which has edges being each bivariate-copula that encodes the (conditional) dependence between variables. The graphical model additionally consists of levels (where there are as many levels as features of the dataset). Each node will be a variable, and edges are coupling between variables based on bivariate copulas. As each level is constructed, the number of nodes decreases per level. The product over all pair-copula densities is then taken to deﬁne the joint copula.

Once we have learned the copula density, we sample the copula to obtain an augmented dataset of pseudo/synthetic samples. The copula samples are then easily transformed back to the natural data scale using the inverse probability integral transform.

Speciﬁcally, assume we have U = (U1, U2...Ud) random variable U(0, 1). We can then use the Copula Cθ to deﬁne variables S = (S1, S2...Sd), where S1 = C 1(U1), S2 = C 1(U2|U1)...Sd = C 1(Ud|U1, U2...Ud 1). This means that S is the inverse Rosenblatt transform of U and hence, S C, which allows us to simulate synthetic/pseudo samples. For more on Copulas in general, we refer the reader to (Joe, 2014).

Complexity. As we go through each tree, there are a decreasing number of pair copulas. i.e. (T1 = d 1, T2 = d 2...Td 1 = 1). Hence, the complexity of this algorithm is O(n) d truncation level), which for all purposes is O(n).

A.2.2. CONFORMAL PREDICTOR Motivation. Conformal prediction allows us to transform any underlying point predictor into a valid interval predictor. We will not discuss the generalized framework of conformal prediction (Transductive Conformal prediction), which requires model training to be redone for every data point. This has a large computational burden for modern datasets with many datapoints. For more, see (Shafer & Vovk, 2008; Balasubramanian et al., 2014).

We instead only discuss Inductive Conformal prediction, which is used in Data-SUITE. The inductive method splits the two processes needed: 1. the training of the underlying model and 2. computing the conformal estimates.

Conformal Prediction Details. Practically, we split the training set (|D+ train| = n) into two disjoint sets, namely the proper training set and calibration set: D+ train = D+ train2 D+ cal, where |D+ train2| = m and (|D+ cal| = n m).

We use the proper training set to create our prediction rules for the feature-wise regressor (g).The calibration set is used for conformalization , i.e. for computing the non-conformity scores and p-values.

The non-conformity score µi of each example is a function which computes the disagreement (i.e. non-conformance) between the prediction and the true value. Note, we only compute non-conformity scores on the calibration set.

To obtain our intervals, we need to determine the critical value ϵ based on the non-conformity scores. We ﬁrst sort them in in descending order. The critical value ϵ is then the (|D+ cal| + 1)(1 α) -th smallest residual (Vovk, 2013). Consequently, for any conﬁdence level (1-α), we can use the critical value to ﬁnd p(x) > α, which corresponds to the maximum and minimum values with p-values larger than α. As a consequence, we can obtain the maximum and minimum values of our predictive intervals.

The whole process is detailed in Algorithm 1. Normalization to obtain adaptive instance-wise intervals can be done using a normalization function sigmai, as described in the main paper.

Data-SUITE: Data-centric identiﬁcation of in-distribution incongruous examples

Algorithm 1 General Inductive Conformal prediction

function CE (α, D+ train2, D+ cal, µ) Input :Signiﬁcance α, nonconformity measure µ, proper training set, D+ train2 and calibration set, D+ cal Output :Interval estimate

1 Train the underlying model g on D+ train2 2 Compute non-conformity scores on the calibration set D+ cal 3 P = {} ; foreach (x, xi) D+ cal do

4 µi(x) = |xi gi f(x)|;

5 Add µi(x) to P

7 CONFORMALIZATION.

8 Sort P in descending order to obtain scores S

9 Determine the critical value of ϵ (|D+ cal| + 1)(1 α) -th smallest residual in S.

10 Construct the interval interval predictor for each new value:

11 procedure Γα(x : X)

12 foreach x Dtest do

13 Apply xm = gi f(x)

15 return [li(x), ri(x)] = [xm ϵ, xm + ϵ]

Remarks on theoretical guarantees. A motivation for conformal prediction in high-stakes settings is the theoretical guarantees on CI coverage validity (See Property A.2).

i.e., at a conﬁdence level (1 α) of 95% (α = 0.05), the true value will be within the CIs in at least 95% of the cases. The framework is non-parametric and only makes the exchangeability assumption (which we detail next) and the guarantees of validity of coverage hold for any choice of dataset, underlying model or nonconformity measure which makes it a versatile option.

Property A.2 (Validity). Under the exchangeability assumption, the conformal predictor will return an interval, P(Y [li(x), ri(x)]) 1 α, i.e, error α.

The validity of the CI holds if the data is independent and identically distributed (IID). In practice, the weaker assumption of exchangeability (see Assumption A.1) also guarantees validity (Lei et al., 2018). This means that we are not required to impose any additional requirements for the validity of the CI, since the aforementioned assumptions on the underlying data are typically made for any ML model.

We also highlight that the validity is maintained even in the case of normalization, as long as the proper training set and calibration set are disjoint.

Assumption A.1 (Exchangeability). In a dataset of n observations, the data points do not follow any particular order, i.e., all n permutations are equiprobable. Exchangeability is weaker than IID observations; however, IID observations satisfy exchangeability.

Data-SUITE: Data-centric identiﬁcation of in-distribution incongruous examples

A.2.3. DATA-SUITE BEYOND TABULAR Currently, the focus of Data-SUITE has primarily engaged with tabular data being the most ubiquitous data type across industries (Borisov et al., 2021). However, there is of course value in application to other high dimensional modalities such as images or text. We provide some possible ideas of how Data-SUITE could be adapted for these other modalities.

Generator. As discussed in the main text, a copula might not be ideal for very high-dimensional (large d X) data in domains such as computer vision or genomics. In those cases, copula modeling can be replaced by domain-speciﬁc augmentation techniques. For example, Generative Adversarial Networks (GANs).

Representer. PCA while the workhorse of tabular data, is typically not suitable for modalities such as images. Methods such as autoencoders could easily replace PCA in this instance.

Conformal Predictor. The primary challenge lies in the conformal predictor. Data-SUITE builds feature-wise regressors which ﬁts the tabular feature-wise setting. Two alternatives are proposed for other modalities. 1. Instead of feature-wise regressors, to construct instance-wise regressors or 2. consider the components of an intermediate representation vector as the features .

Data-SUITE: Data-centric identiﬁcation of in-distribution incongruous examples

B. Benchmarks & Experimental Details B.1. Benchmarks & Implementations B.1.1. DATA-SUITE Implementation details. Data-SUITE adopts a pipeline-based approach to constructing CI s, leveraging copula modeling, representation learning and conformal estimation.

We break down each of these below: Copula Modeling: We use the Copula to estimate a multivariate distribution with univariate marginal distributions. We make use vine copulas (Bedford & Cooke, 2001). for this task.

Speciﬁcally, we use Direct-Vines (D-Vines), which impose constraints on the edges such that we only learn vines of the structure given in Figure 9 below. When ﬁtting each tree, we select the best base copula (Gaussian, Frank, Clayton or Gumbel) based on likelihood.

Figure 9. D-Vine structure

Finally, when we sample from the copula, we speciﬁcally sample nsamples = |Dtrain|.

Representation learning: For the representer, we speciﬁcally make use of Principal Component Analysis (PCA). That said, we can simply replace this block with any alternative such as an autoencoder.

We pre-process all data prior to the representer, by standardizing the data, such that each feature has zero mean and unit variance (see Eq. 9). Note if the features were indeed categorical, we perform one-hot encoding of the features.

When applying PCA, to learn the latent representation, we halve the dimensionality of Dtrain, i.e. d X/2.

Conformal Prediction: The two most important design decisions for conformal estimation is selecting the underlying feature-wise regressor (gi) and the non-conformity score (µi).

We selected gi, as in conventional machine learning by grid-search over different possible underlying models, evaluated on a validation set. We evaluate an MLP, KNN, Decision Trees and Random Forests. We ultimately selected a Decision Tree to serve as the base model for all n = |d X| feature-wise regessors. We use the following parameters max depth=None, min samples split=2, min samples leaf=5.

Interestingly, a simpler base model proved to be more effective and outperformed more parameterized models, which would require more compute. This is an additional advantage of Data-SUITE. We motivate that the simpler model might be directly related to the fact that we build feature-wise regressors, hence the mapping function is easier to approximate.

Finally, we use the absolute error non-conformity score. In practice, we can use any non-conformity score, however, intuitively for our application where we want to approximate the true value as closely as possible, the absolute error makes the most sense. Our non-conformity score used is given by Eq. 10 below.

µi(x) = |xi (gi f)(x)| (10)

B.1.2. BAYESIAN NEURAL NETWORK (BNN) Bayesian modelling, when applied to neural networks, involves a likelihood function p(Y | X, θ), where the parameters θ are estimated by a neural network. In contrast to conventional neural networks which have point estimates for the parameters,

Data-SUITE: Data-centric identiﬁcation of in-distribution incongruous examples

BNNs aim to learn distributions over parameters. However, modern neural networks have many parameters and weights. This makes exact inference largely intractable.

Hence, approximate Bayesian methods such as variational inference are often used instead. By this we mean we do not compute the posterior in the conventional manner. Rather, the variational approximation means we replace the posterior p(θ|D) with a more tractable variational distribution q(θ; λ). The Kullback-Leibler (KL) divergence between the true distribution and the variational distribution is then minimized as the loss function, which equates to maximizing the evidence lower bound (ELBO), see Equation 11. We note, however that the reparameterization trick (Kingma et al., 2015) is needed to make back propagation possible.

LVI D; λ) := DKL q(θ; λ) || p(θ) E log p(D | θ) (11)

Implementation details. We train a 5-layer MLP model. A Gaussian prior is placed over the weights and we optimize the KL divergence during training. Our implementation is based on (Ghosh et al., 2018) and we use the implementation from 5.

B.1.3. DEEP ENSEMBLES (ENSEMBLE) Deep Ensembles (Lakshminarayanan et al., 2017) is widely regarded as the state-of-the-art non-Bayesian uncertainty estimation method. The rationale is that training multiple randomly initialized models allows more robust predictions. Uncertainty can be computed as the variance of the different model predictions. We highlight some important features of Deep Ensembles: (1) optimization based on a proper scoring rule such that the loss has a unique minimum - which encourages the model to approximate the true probability distribution. However, the proper nature of the scoring rule introduces a distributional assumption. (2) Deep Ensembles uses adversarial perturbations based on the Fast Gradient Sign Method (Goodfellow et al., 2014b) and given by Equation 12.

x := x + η sgn x L(x, y; θ) , (12)

where denotes element-wise multiplication.

This modiﬁes the loss function used for gradient descent:

Ltot(X, y; θ) := L(X, y; θ) + L(X , y; θ) , (13)

To construct prediction intervals with an uncertainty estimate, an assumption of a conditionally normal distribution is assumed and the intervals, with uncertainty estimates are given as per Eq. 14.

Γ(x ) := h E[y | x ] p

var[y | x ], E[y | x ] + p

var[y | x ] i . (14)

Implementation details. We have 5 models in the ensemble, all randomly initialized. Each model is a 3-layer MLP, which we train for 10 epochs. The learning rate was empirically determined based on a validation set. To compute the uncertainty estimates at test time we obtain predictions from each model in the ensemble. The prediction interval is computed as per Eq. 14.

B.1.4. GAUSSIAN PROCESS (GP) Gaussian process (GP) models (?) are fully characterized by: the mean and covariance functions µ and P. The inference step can be performed exactly, as marginalizing the multivariate normal distributions can be written in closed form. See Equation 15, whereby conditioning on the dataset D, we can compute the posterior.

(y | x , D) N µ + (Σ )tΣ 1(y µ), Σ (Σ )tΣ 1Σ . (15)

5https://github.com/IBM/UQ360

Data-SUITE: Data-centric identiﬁcation of in-distribution incongruous examples

We note the two assumptions made by GPs: (1) the data is conditionally normally distributed and (2) the covariance selection is correct. That said, violations to these assumptions can severely impact performance. In addition, GPs have an issue of computational complexity due to matrix inversion of O(n3). For high-dimensional data, stochastic approximations are often used instead.

Implementation details. The GP is ﬁt with a radial basis function (RBF) kernel and we make use of the Scikit-learn 6

implementation of GPs which is based on (Williams & Rasmussen, 2006).

B.1.5. MONTE-CARLO DROPOUT (MCD) Dropout layers are typically used as a regularizer at training time . (?). As shown by (Gal & Ghahramani, 2016) dropout networks can be used at test time (Monte-Carlo Dropout (MCD)) and are a variational approximation to deep Gaussian processes. Note that this induces the assumption of normality similar to GPs.

A key component of MCD, is that dropout at test time effectively allows us to obtain an ensemble of different models without having to retrain the model itself. For this, we do multiple stochastic forward passes at test time.

When making predictions, we compute a conditional mean. This is approximated by Monte-Carlo integration, i.e. mean of multiple forward passes. The prediction interval to characterize the uncertainty is given by Eq. 16:

Γ(x ) := h E[y | x ] p

var[y | x ], E[y | x ] + p

var[y | x ] i . (16)

Implementation details. We train a 3-layer MLP with dropout (p = 0.1)for 10 epochs. The learning rate was empirically determined based on a validation set. To compute uncertainty estimates at test time, we perform 20 Monte-Carlo Samples (forward passes at test time). The prediction interval is computed as per Eq. 16.

B.1.6. QUANTILE REGRESSION (QR) Quantile regressors estimate the conditional quantiles of a distribution (rather than conditional mean). Consequently, there is a natural measure of the underlying distribution s spread.

Typically, the MSE loss is replaced by the pinball loss, otherwise known as quantile loss, given by Equation 17, which balances the number of points above and below the quantiles (Koenker & Hallock, 2001). Neural networks are easily applied and optimize the loss function with the appropriate number of output heads.

Lpinball(T ) := X

(x,y) T max (1 α)(ˆqα(x) y), α(y ˆqα(x)) . (17)

Implementation details. The base model for predicting the quantiles is a Gradient Boosting Regressor, with 10 estimators, max depth=5,minimum samples per leaf=5, minimum samples for split=10. We use the implementation from 7.

6https://scikit-learn.org 7https://github.com/IBM/UQ360

Data-SUITE: Data-centric identiﬁcation of in-distribution incongruous examples

B.2. Synthetic Experiment Details B.2.1. DATASET & EXPERIMENT CONFIGURATIONS The synthetic data X = [X1, X2, X3] is drawn IID from a multivariate Gaussian distribution, parameterized by mean vector µ and a positive deﬁnite covariance matrix P.

5.0 0.0 10.0

3.40 2.75 2.00 2.75 5.50 1.50 2.00 1.50 1.25

We sample n = 1000 for both Dtrain and Dtest respectively. We encode inconsistency and uncertainty into the features of the test set Dtest using a multivariate additive model ˆX = X + Z. where Z Rn m, is the perturbation matrix.

We conduct three experiments with different conﬁgurations (Da, Db, Dc), see Table 5

Table 5. Different conﬁgurations of the synthetic data Noise distribution Perturbation Proportions Perturbation variance

Da {Normal} {0.1, 0.25, 0.5, 0.75} {2} Db {Normal} {0.5} {1, 2, 3} Dc {Beta, Gamma, Normal, Weibull} {0.5} {2}

B.2.2. DOWNSTREAM MODEL In Section 4.2, we have a downstream task wherein we compute the MSE for different models. The base model which we train using Dtrain is a linear regression model.

B.3. Real-Data Experiment Details B.3.1. DATASETS SEER Dataset The SEER dataset consists of 240,486 patients enrolled in the American SEER program (Duggan et al., 2016). The dataset consists of features used to characterize prostate cancer: including age, PSA (severity score), Gleason score, clinical stage, treatments etc. A summary of the covariate features can be found in Table 6. The classiﬁcation task is to predict patient mortality, which is binary label {0, 1}.

The dataset is highly imbalanced, where 94% of patients survive. Hence, we extract a balanced subset of of 20,000 patients (i.e. 10,000 with label=0 and 10,000 with label=1).

Table 6. Summary of features for the SEER Dataset (Duggan et al., 2016) Feature Range

Age 37 95 PSA 0 98 Comorbidities 0, 1, 2, 3 Treatment Hormone Therapy (PHT), Radical Therapy - RDx (RT-RDx),Radical Therapy -Sx (RT-Sx), CM Grade 1, 2, 3, 4, 5 Stage 1, 2, 3, 4 Primary Gleason 1, 2, 3, 4, 5 Secondary Gleason 1, 2, 3, 4, 5

CUTRACT Dataset The CUTRACT dataset is a private dataset consisting of 10,086 patients enrolled in the British Prostate Cancer UK program (Prostate Cancer UK). Similar, to the SEER dataset, it consists of the same features to characterize prostate cancer. Additionally, it has the same task to predict mortality. A summary of the covariate features can be found in Table 7.

Once again, the dataset is highly imbalanced, hence we then choose extract a balanced subset of of 2,000 patients (i.e. 1000 with label=0 and 1000 with label=1).

Data-SUITE: Data-centric identiﬁcation of in-distribution incongruous examples

Table 7. Summary of features for the CUTRACT Dataset (Prostate Cancer UK) Feature Range

Age 44 95 PSA 1 100 Comorbidities 0, 1, 2, 3 Treatment Hormone Therapy (PHT), Radical Therapy - RDx (RT-RDx),Radical Therapy -Sx (RT-Sx), CM Grade 1, 2, 3, 4, 5 Stage 1, 2, 3, 4 Primary Gleason 1, 2, 3, 4, 5 Secondary Gleason 1, 2, 3, 4, 5

ADULT Dataset The ADULT dataset (Asuncion & Newman, 2007) has 32,561 instances with a total of 13 attributes capturing demographic (age, gender, race), personal (marital status) and ﬁnancial (income) features amongst others. The classiﬁcation task predicts whether a person earns over $50K or not. We encode the features (e.g. race, sex, gender etc) and a summary can be found in Table 8.

There is a known bias between gender and income in the dataset. We perform a train-test split such that Dtrain and Dtest have approximately equal sizes, with 15,378 and 14,784 samples respectively. In particular, to highlight the data exploration use-case.

Note that there is an imbalance across certain features. However, these are amongst the sensitive attributes. Thus, we do not want to balance the datasets based on this, as we wish to show both in the data exploration and model deployment experiments that we can identify these biases in the datasets. Balancing might eliminate these biases.

Table 8. Summary of features for the ADULT Dataset (Asuncion & Newman, 2007)

Feature Range

Age 17 90 education-num 1 16 marital-status 0, 1 relationship 0, 1, 2, 3, 4 race 0, 1, 2, 3, 4 sex 0, 1 capital-gain 0, 1 capital-loss 0, 1 hours-per-week 1 99 country 0, 1 employment-type 0, 1, 2, 3 salary 0, 1

ELECTRICITY Dataset. The Electricity dataset (Harries & Wales, 1999), represents energy pricing in Australia, over the period of May-1996 to December 1998, with recordings every 30 minutes giving 45312 samples. The dataset records the energy prices and demand for New South Wales and Victoria, and the amount of power transferred between the two states. The goal is to predict whether the transfer price increases or decreases.

The covariates outlined in Table 9 are normalized to the interval [0, 1]

We temporally partition the dataset where Dtrain is Mid-1996 to early-1997 and Dtest is early-1997 to 1998. We note that the dataset has been characterized as having concept shift for some features over the test period (however without an explicit timepoint or label). This could be due to behavioral changes or consumption pattern changes (Zliobaite, 2013). Hence, Dtest consists of data which is both congruous and incongruous with Dtrain.

Data-SUITE: Data-centric identiﬁcation of in-distribution incongruous examples

Table 9. Summary of features for the ELECTRICITY Dataset (Harries & Wales, 1999)

Feature Range

data 0 1 period 0 1 nswprice 0 1 nswdemand 0 1 vicprice 0 1 vicdemand 0 1 transfer 0 1 class 0 1

B.3.2. DOWNSTREAM MODEL In Section 4.3, we have a downstream task for each of the three datasets. For all the datasets, our base model which we train using Dtrain is a Random Forest Classiﬁcation model with 100 estimators in the ensemble and splits are based on the Gini criterion.

Data-SUITE: Data-centric identiﬁcation of in-distribution incongruous examples

C. Additional Experiments This appendix presents additional experiments, validating further properties of Data-SUITE, conducting further comparisons or deep-dives into the regions identiﬁed by Data-SUITE and what insights can be garnered.

C.1. Data-SUITE Ablation Data-SUITE adopts a pipeline-based approach. Hence, to better understand the effect of each component, we perform an ablation study of different constituent components. The constituent components are then compared to the complete pipeline, which we denote as Data-SUITE (ALL).

The two components that we explicitly test are:

Data-SUITE (CONF): Test the conformal predictor without the representer. The conformal prediction process and instance stratiﬁcation are as deﬁned in the main paper.

Data-SUITE (COP): Test the copula by itself. For this ablation, we ﬁt the copula on Dtrain. Then, to compute intervals per feature, we condition on the remaining features and sample 100 estimates from the copula. For example, for feature x1, we condition on x2...xn. The uncertainty estimates are then the variance of these samples, which then is used to stratify the samples as before.

Firstly, we compare the coverage for each of the constituent components as per Table 10 for different conﬁgurations of the synthetic experiment. We see that Data-SUITE (All) is the only method to maintain coverage guarantees across all conﬁgurations. That said, we see a signiﬁcant divergence between Data-SUITE (CONF) and Data-SUITE (COP), which suggests the conformal estimator is the most important component. The performance gap to Data-SUITE (All) is then likely the representer.

Table 10. Coverage of constituent components of Data-SUITE

Proportion (Da) Variance (Db) PERTURBATION .1 .25 .5 1 2 Data-SUITE (All) .97 .96 .96 .95 .96 Data-SUITE (CONF) .89 .88 .90 .87 .90 Data-SUITE (COP) .16 .15 .16 .17 .16

Secondly, we compare the downstream MSE for each of the constituent components as per Table 11 for different conﬁgurations of the synthetic experiment. Recall that a lower MSE (i.e. closer to Train Data (BASELINE)) is desired. Data-SUITE (All) outperforms the constituent components and is less sensitive to perturbations. Interestingly, despite the higher coverage for Data-SUITE (CONF) vs Data-SUITE (COP), when evaluating for a downstream task, Data-SUITE (COP) in fact produces results which are less sensitive to perturbations.

Table 11. Downstream MSE for ablations of constituent components of Data-SUITE

Proportion (Da) Variance (Db) PERTURBATION .1 .25 .5 1 2 Train Data (BASELINE) .067 .059 .068 .065 .068 Test Data .222 .513 .889 .275 .889 Data-SUITE (All) .069 .122 .197 .104 .197 Data-SUITE (CONF) .125 .396 .846 .293 .846 Data-SUITE (COP) .220 .277 .451 .236 .451

Takeaway: Both results show reduced performance for the constituent components of Data-SUITE, with each component having different individual sensitivities. This highlights the necessity of the interconnected Data-SUITE (All) framework.

Data-SUITE: Data-centric identiﬁcation of in-distribution incongruous examples

C.2. Regions identiﬁed by Data-SUITE as NOT OOD In this experiment, we address the question whether the regions identiﬁed by Data-SUITE as uncertain or inconsistent are in fact OOD. We benchmark four widely used methods (with different detection mechanisms), which have been applied in the literature for OOD and outlier detection:

Mahalanobis distance (Lee et al., 2018)

SUOD: Accelerating Large-Scale Unsupervised Heterogeneous Outlier Detection (Zhao et al., 2021)

COPOD: Copula-Based Outlier Detection (Li et al., 2020)

Isolation Forest (Liu et al., 2012)

We note for SUOD, much like the original paper, we make use of an ensemble of base estimators namely: Local outlier factor (LOF) (Breunig et al., 2000), COPOD, Isolation Forest.

For each dataset, we have the instance IDs identiﬁed by Data-SUITE for various proportions. We then apply each of the aforementioned methods and compute the overlap between the predicted OOD/Outlier instances and our identiﬁed uncertain and inconsistent instances.

The results for the overlap of uncertain instances can be found in Fig. 10. We see minimal overlap across methods ranging between 2-18%.

We additionally, evaluate the conﬁdence scores (for those methods which provide conﬁdence scores as outputs). The goal is to see if an instance is predicted as OOD/outlier, then with what conﬁdence does the detection method ascribe to the instance. The results in Fig. 11 suggest that the methods were often unconﬁdent, with average conﬁdence scores ranging between 5-50%. This suggests the identiﬁed uncertain instances are unlikely OOD.

A similar question might be asked for the inconsistent instances. The results are similar to the uncertain instances as shown in Fig. 12. This result, similar to uncertain instances, suggests the identiﬁed inconsistent instances are unlikely OOD.

Takeaway: The Data-SUITE identiﬁed uncertain and inconsistent instances are unlikely OOD. The reason is the limited overlap of predicted OOD and both uncertain and inconsistent instances, coupled with unconﬁdent (low probability) predictions for OOD.

SEER-CUTRACT

ELECTRICITY

Figure 10. OOD-Uncertain Instances Match Rate or overlap

SEER-CUTRACT

ELECTRICITY

Figure 11. Mean probability of predicted OOD, i.e. conﬁdence

Data-SUITE: Data-centric identiﬁcation of in-distribution incongruous examples

SEER-CUTRACT

ELECTRICITY

Figure 12. OOD-Inconsistent Instances Match Rate or overlap

C.3. EDA & digestable prototype insights As discussed in the main paper, we now conduct a more detailed analysis of the regions identiﬁed by Data-SUITE, as well as, the digestible average prototypes. This illustrates the full potential of the detail that practitioners can get from Data-SUITE and how they can practically use Data-SUITE to garner insights about the data. We present the diagram again for easy reference (see Fig 14).

SEER-CUTRACT. Data-SUITE identiﬁes distinct certain (Green) and uncertain (Red) regions of ID data as shown in Fig. 14 (i). Beyond average prototypes for Dtest regions i.e., CUTRACT (UK), we also ﬁnd the nearest neighbor SEER (USA) prototypes for each instance in the identiﬁed regions. Comparing these prototypes assists us to tease out the differences between the two geographic sites. We note three speciﬁc insights which can assist end-users. (1) certain instances represent less severe patients than the uncertain instances (see PSA values). (2) certain instances: CUTRACT (UK) and SEER (USA) prototypes are similar in all-feature values. (3) uncertain instances: CUTRACT (UK) and SEER (USA) prototypes have differences in certain feature values (PSA, more comorbities, different treatment, staging score).

We conduct an extended deep-dive below, highlighting features such as PSA, that show no difference on a population level, or for certain instances. However, when teasing out uncertain instances and their prototypes, we see this difference. This example illustrates the full potential of how Data-SUITE can be used by practitioners to uncover differences across sites (to beneﬁt clinical practice), whilst also identifying patients where model performance would either be reliable or substandard (independent of the model).

We highlight this in Figure 13, which shows no PSA difference between the USA and UK on a population level, or for certain instances, but when teasing out uncertain instances and their prototypes we see this difference.

Figure 13. PSA Deep-Dive, which highlights how the difference is not evident at population level or for certain instances. Only the uncertain instances highlight the heterogeneity. However, it does highlight common support based on the whiskers of the box plot.

We extend this analysis to further illustrate the full potential of the detail that practitioners can get from Data-SUITE. We do this by comparing the Earth Movers Distance (EMD) - see Fig. 15 which is a common metric to ﬂag drift. We see no PSA difference between the USA and UK on a population level, or for certain instances. However, when teasing out uncertain instances and their prototypes, we see this difference.

Takeaway: We highlight that while not evident at population level, the heterogeneous groups can be teased out by using Data-SUITE. This fully demonstrates the added capability of what Data-SUITE offers to practitioners.

Data-SUITE: Data-centric identiﬁcation of in-distribution incongruous examples

FEATURE VALUE

PSA 21 Comorbities 0.1 Treatment RT-RDx Grade 2.5 Stage 2

FEATURE VALUE

PSA 27 Comorbities 0.2 Treatment PHT Grade 3.4

FEATURE VALUE

PSA 63 Comorbities 0.66

Treatment RT-RDx Grade 2.88

FEATURE VALUE

PSA 23 Comorbities 0.5

Treatment RT-RDx Grade 2.33

CUTRACT UNCERTAIN

Nearest SEER

PROTOTYPE Nearest - SEER

i. SEER-CUTRACT: CUTRACT certain instances are similar to their SEER nearest prototypes, whilst CUTRACT uncertain instances are different to their nearest SEER prototypes (e.g. PSA).

FEATURE VALUE Age 36

Marital Status Single

Race White Sex Male

FEATURE VALUE Age 39

Marital Status Married Race Black Sex Female

TEST UNCERTAIN:

PROTOTYPE TEST CERTAIN

ii. Adult: The certain and uncertain instances, represent two different demographics, aligning with the known dataset biases toward females. The uncertain instances speciﬁcally highlight a sub-group of black, females.

TEST CERTAIN

FEATURE VALUE nswprice 0.069 nswdemand 0.35

vicprice 0.003 vicdemand 0.422

transfer 0.41

TEST UNCERTAIN:

FEATURE VALUE nswprice 0.035 nswdemand 0.41 vicprice 0.002

vicdemand 0.38

transfer 0.53

Mid 1996 Early 1997 Mid 1997 Early 1998

Train Test Certain Test Uncertain

iii. Electricity: The certain instances are similar to the training set in features and time. The uncertain instances identiﬁed, represent a later time period, wherein concept drift has likely occurred.

Figure 14. Insights of prototypes identiﬁed by Data-SUITE. Tables describe the average prototypes for certain and uncertain instances.

Data-SUITE: Data-centric identiﬁcation of in-distribution incongruous examples

Earth Movers Distance on the FULL Dataset (Population): Train (SEER) & Test (CUTRACT) - No Difference

Earth Movers Distance: CERTAIN: Train (SEER Nearest Neighbors) & Test (CUTRACT) - No Difference

Earth Movers Distance: UNCERTAIN: Train (SEER Nearest Neighbors) & Test (CUTRACT)- Clear Difference

Figure 15. Earth Movers Distance Deep-Dive into PSA, which highlights how the difference is not evident at the population level or in certain instances. Only the uncertain instances highlight the heterogeneity

Data-SUITE: Data-centric identiﬁcation of in-distribution incongruous examples

Adult. The dataset has a known bias between gender and income, which Data-SUITE successfully identiﬁes. However, an added dimension between the most certain and uncertain instances as shown in Fig. 7 (ii) is that marital status and race are also relevant biases.

We capture this ﬁnding with prototypes, (i) certain prototype: younger, single, white, males (ii) uncertain prototype: older, married, black, females.

These prototypes can inform end-users such as data scientists and stakeholders of the dangers of naively building models on this dataset without ﬁrst considering these issues.

Electricity. The dataset has concept drift over time, which Data-SUITE identiﬁes as the uncertain instances, which have increased transfer of energy between NSW and Victoria, increased NSW energy price and decreased demand relative to the certain instances.

These prototypes can inform end-users of this change in consumption habits, despite the data still lying in-distribution.

Additionally, we note as per Fig. 7 (iii), the certain instances capture the time-frame closer to the training data (Start 1997-Mid 1997), whilst the uncertain instances are later in time (Mid 1997-Early 1998), hence more likely affected by the concept drift.

C.4. Assessment of different CI regressors gi In our formulation, we select a CI regressor (gi). We wish to assess if the choice of CI regressor can impact the performance on a ﬁxed downstream task predictive model.

We hence carry out an experiment where we assess different gi {RF, MLP, SV Rand Tree}, based on the Mean Performance Improvement (MPI) -Eq. 8 on a downstream task.

Takeaway. We ﬁnd the variance of MPI across CI regressor types ranges between 0.04%-3% on our 3 datasets. The low variability in MPI is desirable, as it indicates minimal impact of the choice of CI regressor on Data-SUITE s performance.

C.5. Rank for a diverse set of downstream models We now present additional results for the experiment with a diverse set of downstream models, as outlined in Sec. 4.4. See Figure 16

RF MLP GBT ROBUST Downstream model type

MPI rank (HIGHER BETTER)

ENSEMBLE GP BNN CONFORMAL QR MCD DATA-SUITE

SEER-CUTRACT

ELECTRICITY

Figure 16. Rank assessment for on diverse downstream models

Data-SUITE: Data-centric identiﬁcation of in-distribution incongruous examples

C.6. Inconsistent instances λ sensitivity As described in Section 3.3, an instance x is inconsistent if the fraction of inconsistent features is above a predetermined threshold λ [0, 1]: ν(x) > λ. In our implementation, we use λ = 0.5, i.e. if more than 50% of features are inconsistent, then the sample is considered inconsistent.

For completeness, we conduct an analysis of the sensitivity to the values of λ [0, 1]. We show the accuracy score as a function of λ, as well as, the number of instances that would be classiﬁed as inconsistent for that value of λ. The results are shown in Fig. 17. The ﬂat, steady performance for small λ and then drop off approximately around λ = 0.5 across all datasets, suggests that our chosen value of λ = 0.5 is indeed a sensible choice.

SEER-CUTRACT

ELECTRICITY

Figure 17. λ sweep to ﬂag inconsistent instances

Takeaway: Based on the sweep across all three datasets, we see that λ = 0.5 is a sensible choice in practice.

C.7. Computation time comparison All experiments were run on CPU on a Mac Book Pro with an Intel Core i5 and 16GB RAM. Besides the actual performance values, practitioners often are interested in the computation time associated with different methods. This is especially true if the algorithms are applied to very large datasets with many instances.

We hence conduct a comparison of the computation time needed by each method to both train and to construct the predictive intervals for all instances in Dtest. We present the results in Table 12, with the recorded computation time provided to the nearest minute.

Table 12. Comparison of computation time across methods to the nearest minute

SEER-CUTRACT Adult Electricity

Data-SUITE 4 1 1 BNN 2 1 1 CONFORMAL 2 1 2 ENSEMBLE 28 3 3 GP 22 15 24 MCD 19 13 19 QR 1 1 1

Takeaway: Data-SUITE does not have the fastest computation time (due to the inter-connected framework). However, the computation times are neither too dissimilar and nor prohibitive for practitioners to use.

Data-SUITE: Data-centric identiﬁcation of in-distribution incongruous examples

C.8. Comparison to domain adaptation In the paradigm of domain adaptation, a popular metric is the maximum mean discrepancy (MMD) (Gretton et al., 2012). For example, (Kumagai & Iwata, 2019; Long et al., 2015; Yan et al., 2017; Haeusser et al., 2017) have used MMD as the metric to compare distributions and subsequently, to optimize to minimize the distributional difference of representations.

The MMD metric is a distance-based metric that compares the mean embeddings of two probability distributions source S and target T, in a reproducing kernel Hilbert space Hk. This is given by Eq. 18

MMD(S, T) = ||µk(S) µk(T)||Hk (18)

We can compute unbiased estimates of samples from the two distributions after we apply the kernel trick (in this case we use a Radial Basis Function Kernel).

In the domain adaptation literature, the goal is to minimize the latent feature divergence by optimizing the MMD. Hence, MMD is a fundamental component of domain adaptation and is used to identify instances with discrepancies between source and target domains.

As a comparison to our approach (Data-SUITE), we cast the problem as a potential domain adaptation problem and apply MMD, where the source distribution (S Dtrain) and target distribution (T Dtest). We then use the computed MMD to stratify instances, where lower MMD means certain and large MMD means uncertain.

The mean performance improvement (MPI) is then computed for instances stratiﬁed based on MMD, similar to the experiment in Section 4.3. Table 13 contrasts the results using MMD vs Data-SUITE to stratify instances.

Takeaway. The approach of Data-SUITE achieves greater average performance improvement based on MPI across all three datasets, when compared to framing the problem in the context of domain adaptation (using the MMD metric to identify and stratify instances).

Table 13. Comparison of average performance improvement for Data-SUITE vs MMD (Domain Adaptation)

Data-SUITE MMD

SEER-CUTRACT 0.11 0.088 Adult 0.64 -0.06 Electricity 0.26 -0.17

C.9. Comparison to prototypes We compare the samples identiﬁed by Data-SUITE with prototypes. Speciﬁcally, we compare Proto Dash (Gurumoorthy et al., 2019).

Takeaway. When explaining Dtest using prototypes, the prototypes uniformly cover the manifold, i.e. we cannot easily identify clusters of certai or uncertain as in Data-SUITE.

Alternatively, explaining Dtest with prototypes from Dtrain, the prototypes match the certai instances identiﬁed by Data SUITE (DS), where the Pearson r 0.8 for DS(certain) vs prototypes. On the contrary, it doesn t match the uncertain samples where DS(uncertain) vs prototypes 0.55.

Thus, the prototypes either are uniform samples or certain samples with no distinct uncertain samples . We conclude that Data-Suite has an advantage of identifying BOTH certain and uncertain samples, relevant for reliable model deployment (Desiderata 2)