# data_minimization_at_inference_time__02af7d76.pdf Data Minimization at Inference Time Cuong Tran Department of Computer Science University of Virginia kxb7sd@virginia.edu Ferdinando Fioretto Department of Computer Science University of Virginia fioretto@virginia.edu In domains with high stakes such as law, recruitment, and healthcare, learning models frequently rely on sensitive user data for inference, necessitating the complete set of features. This not only poses significant privacy risks for individuals but also demands substantial human effort from organizations to verify information accuracy. This paper asks whether it is necessary to use all input features for accurate predictions at inference time. The paper demonstrates that, in a personalized setting, individuals may only need to disclose a small subset of their features without compromising decision-making accuracy. The paper also provides an efficient sequential algorithm to determine the appropriate attributes for each individual to provide. Evaluations across various learning tasks show that individuals can potentially report as little as 10% of their information while maintaining the same accuracy level as a model that employs the full set of user information. 1 Introduction The remarkable success of machine learning (ML) models also brought with it pressing challenges at the interface of privacy and decision-making,especially when deployed in consequential domains such as legal processes, banking, hiring, and healthcare [23]. A particularly intriguing aspect is the conventional requirement for users to disclose their entire set of features during inference, thereby creating a gateway for potential data breaches, as exemplified in recent instances where millions of individuals data were compromised []. Concurrently, this practice also places a burden on companies and organizations to ensure the accuracy and legal compliance of the disclosed information, as often observed in financial operations, mandated by legislations like the Corporate Transparency Act[]. Significantly, this conventional approach of disclosing the enture feature set during inference might also also violate the data minimization principle, a cornerstone of several global privacy regulations including the Eurpean General Data Protection Regulation [2], the California Privacy Rights Act [1], and the Brazilian General Data Protection Law [3], among others. Through this lens, the discourse on enhancing privacy and reducing the verification onus in ML systems, particularly in crucial decision-making domains, gains a nuanced dimension, warranting a thorough examination. This paper challenges this setting and asks whether it is necessary to require all input features for a model to produce accurate or nearly accurate predictions during inference. We refer to this question as the data minimization for inference problem. This unique question bears profound implications for privacy in model personalization, which often necessitates the disclosure of substantial user data. We show that, under a personalized setting, each individual may only need to release a small subset of their features to achieve the same prediction errors as those obtained when all features are disclosed. The overall framework is depicted in Figure 1. Following this result, we also provide an efficient sequential algorithm to identify the minimal set of attributes that each individual should reveal. Evaluations across various learning tasks indicate that individuals may be able to report as little as 10% of their information while maintaining the same accuracy level as a model using the complete set of user information. 37th Conference on Neural Information Processing Systems (Neur IPS 2023). Trained Model Fpz85hj+w Pn8AXF9lpw=x2 e3l1UGv W8ni Kcw Cmcgwf X0IBba EILCHB4hd4d Z6c N+fd+Vis Fpz85hj+w Pn8AXF9lpw=x2 data minimization Core feature set (R) representative At inference time Figure 1: Illustrative example of the proposed framework. The figure higlights the online and personalized nature of the minimization process. For each user, a key subset of features is identified to be used (core feature set). The minimized data sample is then processed by the pre-trained model, which generates a representative label. In summary, the paper makes the following contributions: (1) it initiates a study to analyze the optimal subset of data features that each individual should disclose at inference time in order to achieve the same accuracy as if all features were disclosed; (2) it links this analysis to a new concept of data minimization for inference in relation to privacy, (3) it proposes theoretically motivated and efficient algorithms for determining the minimal set of attributes each individual should provide to minimize their data; and (4) it conducts a comprehensive evaluation illustrating the effectiveness of the proposed methods in preserving privacy at no or small costs in accuracy. Related work. To the best of our knowledge, this is the first work studying the connection between data minimization and accuracy at inference time. This work was motivated by the core principle in data minimization which states that all collected data shall be adequate, relevant and limited to what is necessary in relation to the purposes for which they are processed . This principle is adopted as a privacy cornestone in different data protection regulations, including the General Data Protection Regulation [2] in Europe, the California Privacy Rigths Act in the US [1], the General Personal Data Protection Law in Brasil [3], the Protection of Personal Information Act, in South Africa [5], and the Personal Information Protection Act, in South Korea [4]. While we are not aware of studies on data minimization for inference problems, we draw connections with differential privacy, feature selection, and active learning. Differential Privacy (DP) [14] is a strong privacy notion that determines and bounds the risk of disclosing sensitive information of individuals participating in a computation. In the context of machine learning, DP ensures that algorithms can learn the relations between data and predictions while preventing them from memorizing sensitive information about any specific individual in the training data. In such a context, DP is primarily adopted to protect training data [6, 12, 32] and thus the setting contrasts with that studied in this work, which focuses on identifying the superfluous features revealed by users at test time to attain high accuracy. Furthermore, achieving tight constraints in differential privacy often comes at the cost of sacrificing accuracy, while the proposed privacy framework can reduce privacy loss without sacrificing accuracy under the assumption of linear classifiers. Feature selection [10] is the process of identifying a relevant subset of features from a larger set for use in model construction, with the goal of improving performance by reducing the complexity and dimensionality of the data. The problem studied in this work can be considered a specialized form of feature selection with the added consideration of personalized levels, where each individual may use a different subset of features. This contrasts standard feature selection [20], which selects the same subset of features for each data sample. Additionally, and unlike traditional feature selection, which is performed during training and independent of the deployed classifier [10], the proposed framework performs feature selection at deployment time and is inherently dependent on the deployed classifier. The framework proposed in this paper shares some similarities with active learning [15, 26, 33], whose goal is to iteratively select samples for experts to label in order to construct an accurate classifier with the least number of labeled samples. Similarly, the proposed framework iteratively asks individuals to reveal one attribute given their released features so far, with the goal of minimizing the uncertainty in model predictions. Finally, our study share some connections with active feature acquisition [28, 13, 11, 21] which can be categorized as one branch in active learning. However, these priors focus mostly on minimizing the total number of users features at training step. This is in sharp contrast with ours work which concentrate on inferece time. 2 Settings and objectives We consider a dataset D consisting of samples (x, y) drawn from an unknown distribution Π. Here, x X is a feature vector and y Y = [L] is a label with L classes. The features in x are categorized into public x P and sensitive x S features, with their respective indexes in vector x denoted as P and S, respectively. We consider classifiers fθ : X Y, which are trained on a public dataset from the same data distribution Π above. The classifier produces a score fθ(x) RL over the classes and a final output class, fθ(x) [L], given input x. The model s outputs fθ(x) and fθ(x) are also often referred to as hard and soft predictions, respectively. Without loss of generality, we assume that all features in X lie within the range [ 1, 1]. In this setting, we are given a trained model fθ and, at inference time, we have access to the public features x P . These features might be revealed through user queries or collected by the provider during previous interactions. Our focus is on the setting where |S| |D|, and for simplicity, the following considers binary classifiers, where Y = {0, 1} and fθ R. Multi-class settings are addressed in Appendix C. In this paper, the term data leakage of a model, refers to the percentage of sensitive features that are revealed unnecessarily, meaning that their exclusion would not significantly impact the model s output. Our goal is to design algorithms that accurately predict the output of the model using the smallest possible number of sensitive features, thus minimizing the data leakage at inference time. This objective reflects our desire for privacy. To clarify key points discussed in the paper, let us consider a loan approval task where individual features are represented by the set {Job, Loc(action), Inc(ome)}. In this example, the Job feature x P is public, whereas Loc and Inc x S are sensitive. We also consider a trained linear model fθ = 1.0 Job 0.5 Loc + 0.5 Inc 0 and look at a scenario where user (A) has a public feature Job = 1.0, and user (B) has a public feature Job = 0.9. Both users sensitive feature values are unknown. However, for user A, the outcome can be conclusively determined without revealing any additional information since all features are bounded within [ 1, 1]. In contrast, for user B, the outcome cannot be determined solely based on the public feature, but revealing the sensitive feature Loc = 1.0 is enough to confirm the classifier outcome. This example highlights two important observations that motivate our study: (1) not all sensitive attributes may be required for decision-making during inference, and (2) different individuals may need to disclose different amounts and types of sensitive information for decision-making. 3 Core feature sets With these considerations in mind, this section introduces the notion of core feature set, the first contribution of the paper, which will be used to quantify data minimization. The paper presents the key findings and defers all proofs in Appendix A. Throughout the paper, the symbols R and U are used to represent the sets of indices for revealed and unrevealed features of the sensitive attribute S, respectively. Given a vector x and an index set I, we use x I to denote the vector containing entries indexed by I and XI to represent the corresponding random variable. Finally, we write fθ(XU, XR =x R) as a shorthand for fθ(XU, XR =x R, XP =x P ) to denote the prediction made by the model when the features in U are unrevealed. Our objective is to develop algorithms that can identify the smallest subset of sensitive features to disclose, ensuring that the model s output is accurate (with high probability) irrespective of the values of the undisclosed features. We refer to this subset as the core feature set. Definition 1 (Core feature set). Consider a subset R of sensitive features S, and let U =S \ R be the unrevealed features. The set R is a core feature set if, for some y Y, Pr fθ(XU, XR = x R) = y 1 δ, (1) where δ [0, 1] is a failure probability. When δ = 0 the core feature set is called pure. Additionally, the label y satisfying Equation (1) is called the representative label for the core feature set R. The concept of the representative label y is crucial for the algorithms that will be discussed later. These algorithms use limited information to make predictions and when predictions are made using a set of unrevealed features, the representative label y will be used in place of the model s prediction. In identifying core feature sets to minimize data leakage, it s crucial to consider model uncertainty, which refers to the unknown values of unrevealed features. The following result links core feature sets with model entropy, which measures uncertainty and is used by this work to minimize data leakage. Proposition 1. Let R S be a core feature set with failure probability δ < 0.5. Then, there exists a monotonic decreasing function ϵ : R+ R+ with ϵ(1) = 0 such that: H fθ(XU, XR = x R) ϵ(1 δ), where H[Z]= P z [L] Pr(Z = z) log Pr(Z = z) is the entropy of the random variable Z. This property highlights the relationship between core feature sets and entropy associated with a model using incomplete information. As the δ value decreases, the model s predictions become more certain. When δ equals zero (or when R represents a pure core feature set), the model s predictions can be fully understood without observing x U, resulting in entropy of 0. It is worth noticing that enhancing prediction accuracy necessitates revealing additional information, as illustrated by the previous result and the renowned information theoretical proposition below: Proposition 2. Given two subsets R and R of sensitive features S, with R R , H fθ(XU, XR = x R) H fθ(XU , XR = x R ) , where U = S \ R and U = S \ R . 0 2 4 6 Min pure core feat set size Cumulative frequency Figure 2: Frequency associated with the size of the minimum pure core feature set. Thus, the parameter δ plays a crucial role in balancing the trade-off between privacy loss and model prediction s uncertainty. It determines the amount of sensitive information that must be disclosed to have enough confidence on model predictions. As δ increases, fewer sensitive features need to be revealed, resulting in reduced data leakage but also less certainty on model predictions, and vice versa. Note that disclosing more sensitive attributes does not necessarily improve the accuracy of predictions for a specific individual, as highlighted in previous research [30]. Although it s challenging to rigorously analyze the relationship between privacy loss and predictive accuracy across various contexts, our empirical studies across multiple datasets and learning algorithms do suggest that exposing more sensitive features generally enhances prediction accuracy. Thus, the parameter δ appears to govern the trade-off between sacrificing privacy and gaining predictive performance. As highlighted in the previous example, the core feature set is not uniform for all users. This is further exemplified in Figure 2, using the Credit dataset [9] with a logistic regression classifier. The figure reports the cumulative count of users against the minimum number of features they need to disclose to ensure confident predictions. It demonstrates that many individuals need to disclose no additional information to attain accurate predictions (corresponding to a pure feature of set size 0), and most individuals can achieve accurate predictions by disclosing only 2 sensitive features. These insights, together with the previous observations linking core feature sets to entropy, motivate the proposed online algorithm, the second contribution of the paper. 4 Min DRel: An algorithm to minimize data release at inference time The goal of the proposed algorithm, called Minimize Data Reveal (Min DRel), is to uphold privacy during inference by revealing sensitive features one at a time based on their released feature values. This section provides a high-level description of the algorithm and outlines its challenges. Next, Section 5, applies Min DRel to linear classifiers and discusses its performance on several datasets and benchmarks. Further, Section 6, extends Min DRel to non-linear classifiers and considers an evaluation over a range of standard datasets. Overview of Min DRel. Min DRel operates fundamentally on two critical actions: 1. determining the next feature to reveal for each user and 2. verifying whether the disclosed features make up a core feature set for that user. These two operations will be discussed in sections 4.1 and 4.2, respectively. The algorithm determines which feature to disclose for a specific user by inspecting the posterior probabilities Pr(Xj|XR = x R, XP = x P ) for each unrevealed feature j U, taking into account the disclosed sensitive features x R and public features x P . Given the current set of disclosed features x R and unrevealed features x U, Min DRel chooses the subsequent feature j U as follows: j = argmax j U F(x R, xj; θ) .= argmax j U H fθ(Xj = xj, XU\{j}, XR = x R) , (2) where F is a scoring function that evaluates the amount of information that can be acquired about the model s predictions when feature Xj is disclosed. As suggested in previous sections, it s desirable to reveal the feature that provides the most insight into the model prediction upon disclosure. Min DRel uses Shannon entropy for this purpose as it offers a natural method for quantifying information. Once feature Xj is disclosed with a value of xj, the algorithm updates the posterior probabilities for all remaining unrevealed features. The process concludes either when all sensitive features have been disclosed or when a core feature set has been identified. It should also be noted that, within this framework, there is no need to perform data imputation when some features are missing. Unrevealed features are treated as random variables and are integrated during the prediction process. Both the computation of the scoring function F and the verification of whether a set of disclosed features constitute a core feature set present two significant challenges for the algorithm. The rest of the section delves into these difficulties. 4.1 Computing the scoring function F Designing a scoring function F that measures how confident a model s prediction is when a user discloses an additional feature Xj brings up two key challenges. First, the value of Xj is unknown until the decision to reveal it is made, which complicates the computation of the entropy function. Second, even if the value of Xj were known, determining the entropy of model predictions in an efficient manner is another difficulty. We next discuss how to overcome these challenges. Dealing with unknown values. To address the first challenge, we exploit the information encoded in the disclosed features to infer Xj s value and compute the posterior probability Pr(Xj|XR =x R) of the unrevealed feature Xj given the values of the revealed ones. The scoring function, abbreviated as F(Xj), can thus be modeled as the expected negative entropy given the randomness of Xj, F(Xj) = EXj H[fθ(Xj, XU\{j}, XR =x R) fθ(Xj =z, XU\{j}, XR =x R) Pr(Xj =z|XR =x R) | {z } B where z Xj is a value in the support of Xj. Efficient entropy computation. The second difficulty relates to how to estimate this scoring function efficiently. Indeed this is challenged by two key components. The first (A) is the entropy of the model s prediction given a specific unrevealed feature value, Xj = z. This prediction is a function of the random variable XU\{j}, and, due to Proposition 1, its estimation is linked to the conditional densities Pr(XU\{j}|XR = x R, Xj = z). The second (B) is the conditional probability Pr(Xj = z|XR = x R). Efficient computation of these conditional densities is discussed next. First, we discuss a result relying on the joint Gaussian assumption of the input features. This result will be useful in providing a computationally efficient method to estimate such conditional density functions. In the following, ΣIJ represents a sub-matrix of size |I| |J| of a matrix Σ formed by selecting rows indexed by I and columns indexed by J. Proposition 3. The conditional distribution of any subset of unrevealed features U U, given the the values of released features XR = x R is given by: Pr(XU |XR = x R) = N µU + ΣU RΣ 1 RR(x R µR), ΣU U ΣU RΣ 1 RRΣRU , where Σ is the covariance matrix. Note that Equation (3) considers U = {j}, and thus, component (B) can be computed efficiently exploiting the result above. To complete Equation (3), we need to estimate the entropy H[fθ(Xj = z, XU\j, XR =x R)] (component A) for a specific instance z drawn from Pr(Xj|XR =x R). This poses a challenge due to the non-linearity of the hard model predictions fθ adopted. To tackle this computational challenge, we first estimate component A using soft labels fθ and then apply a thresholding operator. More specifically, we first estimate Pr( fθ(Xj =z, XU\{j}, XR =x R)) and, based on this distribution, we subsequentially estimate fθ as 1{ fθ 0}, where 1 is the indicator function. In the following sections, we will show how to assess this estimate for linear and non-linear classifiers. Finally, by approximating the distribution over soft model predictions through Monte Carlo sampling, the score function in F(Xj) can be computed as F(Xj) 1/|Z| X fθ(Xj = z , XU\{j}, XR = x R) where Z is a set of random samples drawn from Pr(Xj|XR = x R) and estimated through Proposition 3, which thus can be computed efficiently. When the Gaussian assumption does not hold, one can recur to (slower) Bayesian approaches to estimate the uncertainty of unrevealed features XU given the set of revealed features XR = x R. A common approach involves treating XU as the target variable and employing a neural network to establish the mapping XU = gw(XR). Utilizing Bayesian techniques [17, 18], the posterior of the network s parameter p(w|D) = p(w)p(D|w) can be computed initially. Based on the posterior distribution of the model s parameters w, the posterior of unrevealed features can be calculated as Pr(XU = x U|XR = x R) = Prw p(w|D)(gw(XR) = x U). However, implementing such a Bayesian network not only significantly increases training time but also inference time. Since it is necessary to compute Pr(XU|XR = x R) for all possible choices of U S, the number of Bayesian neural network regressors scales exponentially with |S|. 0.830 Accuracy Bayesian NN Gaussian Data Leakage Bayesian NN Gaussian Figure 3: Comparison between Bayesian NN vs Gaussian in term of accuracy and data Lekage. Importantly, in our evaluation, the data minimization method that operates under the Gaussian assumption maintains similar decisionmaking and produces comparable outcomes to the Bayesian approach, even in cases where the Gaussian assumption is not applicable in practical settings. Figure 3 illustrates this comparison, showcasing the performance of the proposed mechanism on a real dataset (Credit dataset with |S| = 5) concerning accuracy (higher is better) and data leakage (lower is better) across various failure probability δ values. Notice how similar is the performance of the mechanisms that either leverage the Gaussian assumption or operate without it (Bayesian NN). Importantly, the assumption of a Gaussian distribution is not overly restrictive or uncommon. In fact, it is a cornerstone in many areas of machine learning, including Gaussian Processes [31], Bayesian optimization [27], and Gaussian Graphical models [24]. 4.2 Testing a core feature set The proposed iterative algorithm terminates once it determines whether a subset R of the sensitive feature set S constitutes a core feature set. This validation process falls into two scenarios: 1. When δ=0: To confirm that R is a pure core feature set, it is sufficient to verify that fθ(XU, XR = x R) remains constant for all possible realizations of XU. As we will show in Section 5, linear classifiers can perform this check in linear time without making any specific assumptions about the input distribution. 2. When δ>0: In this case, the property above is no longer valid. As per Definition 1, to confirm a core feature set, it is essential to estimate the distribution of Pr( fθ(XU, XR = x R)). In Section 5, we demonstrate how to analytically estimate this distribution for linear classifiers. Furthermore, in Section 6, we illustrate how to locally approximate this distribution for nonlinear classifiers and derive a simple yet effective estimator that can be readily implemented in practice. 5 Min DRel for linear classifiers This section will devote to estimating the distribution Pr( fθ(Xj =z, XU\{j}, XR =x R)), simply expressed as Pr( fθ(XU, XR =x R)) and provides an instantiation of Min DRel for linear classifiers. In particular, it shows that both the estimation of the conditional distributions required to compute the scoring function F(Xj) and the termination condition to test whether a set of revealed features is a core feature set, can be computed efficiently. This is an important property for the developed algorithms, which are designed to be online and interactive. 5.1 Efficiently Estimating Pr( fθ(XU, XR = x R)) For a linear classifier fθ = θ x, notice that when the input features are jointly Gaussian, the model predictions fθ(x) are also Gaussian, as highlighted by the following result. Proposition 4. The model soft prediction, fθ(XU, XR = x R) = θUXU +θRx R is a random variable following a Gaussian distribution N mf, σ2 f , with mf = θRx R + θ U µU + ΣURΣ 1 RR(x R µR) (5) σ2 f = θ U ΣUU ΣURΣ 1 RRΣRU θU, (6) where θU is the sub-vector of parameters θ corresponding to the unrevealed features U. The result above is used to assist in calculating the conditional distribution of the model hard predictions fθ(x), following thresholding. This is a random variable that adheres to a Bernoulli distribution, as shown next, and will be used to compute the entropy of the model predictions, as well as to determine if a subset of features constitutes a core set. Proposition 5. Let the soft model predictions fθ(XU, XR = x R) be a random variable following a Gaussian distribution N(mf, σ2 f). Then, the model prediction following thresholding fθ(XU, XR = x R) is a random variable following a Bernoulli distribution B(p) with p = Φ( mf σf ), where Φ( ) is the CDF of the standard Normal distribution, and mf and σf, are given in Eqs (5) and (6), respectively. 5.2 Testing pure core feature sets In this subsection, we outline the methods for determining if a subset U is a pure core feature set, and, if so, identifying its representative label. As per Definition 1, U is a pure core feature set if fθ(XU, XR = x R) = y for all XU. This implies that fθ(XU, XR = x R) = θ U XU + θ Rx R must have the same sign for all XU in the range of [ 1, 1]|U|. Given the box constraint XU [ 1, 1]|U|, the linearity of the model considered allows us to directly compute the maximum and minimum values of fθ(XU, XR = x R), rather than enumerating all possible values. Specifically, we have: max XU θ U XU + θ Rx R = θU 1 + θ Rx R min XU θ U XU + θ Rx R = θU 1 + θ Rx R. Thus, if both these maximum and minimum values are negative (non-negative), then U is considered a pure core feature set with representative label y = 0 ( y = 1). If not, U is not a pure core feature set. Importantly, determining whether a subset R of sensitive features S constitutes a pure core feature set can be accomplished in linear time with respect to the number of features. Proposition 6. Assume fθ is a linear classifier. Then, determining if a subset U of sensitive features S is a pure core feature set can be performed in O(|P| + |S|) time. 5.3 Min DRel-linear Algorithm and Evaluation A pseudo-code of Min DRel specialized for linear classifiers is reported in Algorithm 1. At inference time, the algorithm takes as input a sample x (which only exposes the set of public features x P ) and uses the training data D to estimate the mean and covariance matrix needed to compute the conditio- nal distribution of the model predictions given the unrevealed features (line 1), as discussed above. After initializing empty the set of revealed features (line 2), it iteratively releases a feature at a time until a core feature set (and its associated representative label) are determined, as discussed in detail in Section 5.2. The released feature Xj is the one, among the unrevealed features U, that maximizes the scoring function F (line 12). Computing such a scoring function entails estimating the conditional distribution Pr(Xj|XR = x R) (line 8), constructing a sample set Z from such distribution (line 9), and approximating the distribution over soft model predictions through Monte Carlo sampling to compute (line 10). Finally, after each iteration, the algorithm updates the set of the revealed and unrevealed features (line 13). Algorithm 1: Min DRel for linear classifiers input :A test sample x; training data D output :A core feature set R and its representative label y 1 (µ, Σ) 1 |D| P (x,y) D x, 1 |D| P (x,y) D(x µ)(x µ) 2 Initialize R = 3 while True do 4 if R is a core feature set with repr. label y then 5 return (R, y) 7 foreach j U do 8 Compute Pr(Xj|XR = x R) (using Prop. 3) 9 Z sample(Pr(Xj|XR = x R)) T times 10 Compute Pr fθ(Xj = z, XU\{j}XR = x R) (using Prop. 4 and 5) 11 Compute F(Xj) (using Eq. (4)) 12 j argmaxj F(Xj) 13 (R, U) R {j }, U \ {j } Notice that Min DRel relies on estimating the mean vector and covariance matrix from the training data, which is considered public, for the scope of this paper. If the training data is private, various techniques exist to release DP mean, and variance [22, 7] and can be readily adopted. Nonetheless, the protection of training data is beyond the focus of this work. Figure 4 reports the cumulative count of users (y-axis) against the minimum number of features they need to disclose (x-axis) to ensure confident predictions, for various failure probabilities δ. The model adopted is a Logistic Regression classifier trained on the Bank dataset [9]. The data minimization achieved by Min DRel is clearly apparent. For each test sample, Min DRel identifies core feature sets that are significantly smaller than the total sensitive feature set size |S| = 7. Interestingly, when δ > 0, it discovers core feature sets smaller than 2 for the majority of users. This implies that most users would only need to reveal a small portion of their sensitive data to achieve accurate model predictions with either absolute certainty or high confidence. 6 Min DRel for non-linear classifiers 0 2 4 6 Core feature set size Cumulative frequency =0 =0.1 =0.2 Figure 4: Cumulative count of users against their core feature set size for various δ. Next, the paper focuses on computing the estimate Pr( fθ(XU, XR = x R)) and determining core feature sets when fθ is a nonlinear classifier. Then, the section presents results that illustrate the practical benefits of Min DRel for data minimization at inference time on neural networks. 6.1 Efficiently estimating Pr( fθ(XU, XR = x R)) First notice that even if the input features x are jointly Gaussian, the outputs fθ(x) will no longer adhere to a Gaussian distribution after a non-linear transformation. This complicates estimating the distribution Pr( fθ(XU, XR = x R). To tackle this challenge, the paper takes a local approximation of the model predictions fθ using a Gaussian distribution, as demonstrated in the following result. Theorem 1. The distribution of the random variable fθ = fθ(XU, XR = x R) where XU N µpos U , Σpos U can be approximated by a Normal distribution as fθ(XU = µpos U , XR = x R), g U Σpos U g U , (7) where g U = XU fθ(XU = µpos U , XR = x R) is the gradient of model prediction at XU = µpos U . Therein, the mean vector µpos U and covariance matrix Σpos U of Pr(XU|XR =x R) are derived from Proposition 3. This result leverages a first-order Taylor approximation of model fθ around its mean. 6.2 Testing pure core feature sets Unlike linear classifiers, the case for nonlinear models lacks an exact and efficient method to determine whether a set is a core feature set. This is primarily due to the non-convex nature of the adopted models, which poses challenges in finding a global optimum. This section thus proposes an approximate testing routine and demonstrates its practical ability to significantly minimize data leakage during testing while maintaining high accuracy. To determine if a subset U of the sensitive features S is a pure core feature set, we consider a set Q of input points (XU, x R). The entries corresponding to the revealed features are set to the value x R, while the entries corresponding to the unrevealed features are sampled from the distribution Pr(XU|XR = x R). The test verifies if the model predictions fθ(x) remain constant for all x in Q. In the next section, we will show that even considering arbitrary classifiers (e.g., we use standard neural networks), Min DRel can reduce data leakage dramatically when compared to standard approaches. The Min DRel algorithm for nonlinear classifiers differs from Algorithm 1 primarily in the method used to compute the estimates for the distribution Pr(fθ(Xj = z, XU\j, XR = x R)) of the soft model predictions (line 11). In this case, this estimate is computed by leveraging the results from Theorem 1 and Proposition 5. Moreover, the termination test of the algorithm is based on the discussion in the previous section. Appendix B reports a description of the algorithm s pseudocode. 7 Experiments Datasets and settings. This section evaluates Min DRel s effectiveness in limiting data exposure during inference. The experiments are conducted on six standard UCI datasets [9]. To further emphasize the benefits of Min DRel, we compare it to two baselines: the optimal model, which employs a brute force method to find the smallest core feature set and its representative label and assumes all sensitive features are known, and the all-features model, which simply adopts the original classifiers using all the data features for each test sample. The performances are displayed for a varying number of sensitive attributes |S| [2, 5], while we delegate a study for larger |S| to the Appendix (which excludes the time-consuming baseline optimal, as intractable for large |S|). For each choice of |S|, we randomly select |S| features from the entire feature set and designate them as sensitive attributes. The remaining attributes are considered public. The average accuracy and data leakage are then reported based on 100 random sensitive attribute selections. Due to page limit, we present a selection of the results and discuss their trends on Bank dataset. A comprehensive overview, additional analysis, and experiments are available in the Appendix D. 7.1 Linear classifiers F-Score Importance Random 2 4 # Sensitive Features Data Leakage 2 4 # Sensitive Features 2 4 # Sensitive Features Optimal All features Min DRel( =0) Min DRel( =0.1) Min DRel( =0.2) Figure 5: Comparison of F-Score, Importance and Random method for different δ with two baseline methods: Optimal and All features in term of Accuracy and Data Leakage. The underlying classifier is a logistic regression classifier. Figure 5 depicts performance results in accuracy (top subplots/higher is better) and relative data leakage (bottom subplots/lower is better) with varying number of revealable sensitive features |S|. The comparison includes three Min DRel versions: F-Score, which utilizes the scoring function elaborated in Section 4.1 to select the next feature to disclose (left); Importance, which employs a feature importance criterion leveraging the model fθ parameters to rank features (middle), detailed in the Appendix; and Random, which arbitrarily selects the next feature to reveal (right). All three versions use the same test procedure to validate whether a set qualifies as a core feature set. Firstly, notice that Min DRel achieves equal or better accuracy than the optimal mechanism and baseline that utilizes all features during testing. This suggests that data minimization can be accomplished under linear models without compromising accuracy! Next, observe that an increase in δ aids in safeguarding data minimization, as evidenced by the drop in relative data leakage (note the logarithmic scale of the y-axis). This is attributable to the influence of δ on the test for identifying a core feature set, thereby reducing its size. Finally, the proposed scoring function outperforms other versions in terms of data leakage minimization, allowing users to disclose substantially fewer sensitive features. The Appendix also includes experiments with larger quantities of sensitive features, presenting analogous trends, where, however, a comparison against an optimal baseline was not possible in due to its exponential time complexity. 7.2 Non-linear classifiers F-Score Importance Random 2 4 # Sensitive Features Data Leakage 2 4 # Sensitive Features 2 4 # Sensitive Features Optimal All features Min DRel( =0) Min DRel( =0.1) Min DRel( =0.2) Figure 6: Comparison of F-Score, Importance and Random method for different δ with two baseline methods: Optimal and All features in term of Accuracy and Data Leakage. The underlying classifier is a neural network. To assess Min DRel s performance in reducing data leakage when employing standard nonlinear classifiers, we use a neural network with two hidden layers and Re LU activation functions and train the models using stochastic gradient descent (see Appendix D for additional details). The evaluation criteria, baselines, and benchmarks adhere to the same parameters set in Section 5.3. Figure 6 showcases the results in terms of accuracy (top subplots) and data leakage (bottom subplots). Unlike linear classifiers, nonlinear models using Min DRel with a failure probability of δ = 0 cannot guarantee the same level of accuracy as the all features baseline model. However, this accuracy difference is minimal. Notably, a failure probability of δ = 0 enables users to disclose less than half, and up to 90% fewer sensitive features across different datasets, while achieving accuracies similar to those of conventional classifiers. Next, similarly to as observed in the previous section, Min DRel with the proposed F-score selector significantly outperforms the other variants in terms of data leakage minimization. Furthermore, when considering higher failure probabilities, data leakage decreases significantly. For instance, with δ 0.1, users need to disclose only 5% of their sensitive features while maintaining accuracies comparable to the baseline models (the largest accuracy difference reported was 0.005%). These results are significant: They show that the introduced notion of privacy leakage and the proposed algorithm can become a valuable tool to protect individuals data privacy at test time, without significantly compromising accuracy.. 8 Conclusion and Future Work This paper introduced the concept of data minimization at test time whose goal is to minimize the number of features that individuals need to disclose during model inference while maintaining accurate predictions from the model. The motivations of this notion are grounded in the privacy risks imposed by the adoption of learning models in consequential domains, and align with the data minimization principle. The paper then discusses an iterative and personalized algorithm that selects the features each individual should release with the goal of minimizing data leakage while retaining exact (in the case of linear classifiers) or high (for non-linear classifiers) accuracy. Experiments over a range of benchmarks and datasets indicate that individuals may be able to release as little as 10% of their information without compromising the accuracy of the model, providing a strong argument for the effectiveness of this approach in protecting privacy while preserving the accuracy of the model. While this study is the first attempt at defining data minimization during inference, it also opens up avenues for further research. First, establishing bounds on the data leakage provided by our proposed method compared to an optimal procedure presents an interesting and open challenge. Second, exploring the relationship between data minimization principles and their consequent disparate impacts presents another open direction. Lastly, developing effective algorithms to provably construct core feature sets for non-linear classifiers is another important area of investigation. Acknowledgements This research was partially supported by NSF grant 2133169, a Google Research Scholar Award and an Amazon Research Award. [1] California privacy rights act. https://cppa.ca.gov/. Accessed: 2023-10-23. [2] General data protection regulation. https://gdpr-info.eu/. Accessed: 2023-10-23. [3] General personal data protection law. [4] Personal information protection act. https://elaw.klri.re.kr/eng_service/law View. do?hseq=53044&lang=ENG. Accessed: 2023-10-23. [5] Protection of personal information act. https://popia.co.za/. Accessed: 2023-10-23. [6] Abadi and et al. Deep learning with differential privacy. In Proceedings of the 2016 ACM SIGSAC Conference on Computer and Communications Security, 2016. [7] K. Amin, T. Dick, A. Kulesza, A. Munoz, and S. Vassilvitskii. Differentially private covariance estimation. Advances in Neural Information Processing Systems, 32, 2019. [8] C. M. Bishop and N. M. Nasrabadi. Pattern recognition and machine learning, volume 4. Springer, 2006. [9] C. Blake and C. Merz. Uci repository of machine learning databases, 1988. [10] G. Chandrashekar and F. Sahin. A survey on feature selection methods. Computers & Electrical Engineering, 40(1):16 28, 2014. [11] C.-H. Chang, M. Mai, and A. Goldenberg. Dynamic measurement scheduling for event forecasting using deep rl. pages 951 960, 2019. [12] K. Chaudhuri, C. Monteleoni, and A. D. Sarwate. Differentially private empirical risk minimization. Journal of Machine Learning Research, 2011. [13] G. Contardo, L. Denoyer, and T. Artières. Sequential cost-sensitive feature acquisition. pages 284 294, 2016. [14] C. Dwork, F. Mc Sherry, K. Nissim, and A. Smith. Calibrating noise to sensitivity in private data analysis. In Theory of cryptography conference, pages 265 284. Springer, 2006. [15] Y. Fu, X. Zhu, and B. Li. A survey on instance selection for active learning. Knowledge and information systems, 35(2):249 283, 2013. [16] A. Krause and C. Guestrin. A note on the budgeted maximization of submodular functions. Citeseer, 2005. [17] R. Krishnan, P. Esposito, and M. Subedar. Bayesian-torch: Bayesian neural network layers for uncertainty estimation, Jan. 2022. [18] R. Krishnan and O. Tickoo. Improving model calibration with accuracy versus uncertainty optimization. In Advances in Neural Information Processing Systems, volume 33, pages 18237 18248, 2020. [19] Larxel. Children fetal health dataset. https://www.kaggle.com/datasets/andrewmvd/ fetal-health-classification, 2021. [20] J. Li, K. Cheng, S. Wang, F. Morstatter, R. P. Trevino, J. Tang, and H. Liu. Feature selection: A data perspective. ACM computing surveys (CSUR), 50(6):1 45, 2017. [21] Y. Li and J. Oliva. Active feature acquisition with generative surrogate models. pages 6450 6459, 2021. [22] X. Liu, W. Kong, S. Kakade, and S. Oh. Robust and differentially private mean estimation. Advances in neural information processing systems, 34:3887 3901, 2021. [23] N. Papernot, P. Mc Daniel, A. Sinha, and M. Wellman. Towards the science of security and privacy in machine learning. ar Xiv preprint ar Xiv:1611.03814, 2016. [24] C. Peterson, F. C. Stingo, and M. Vannucci. Bayesian inference of multiple gaussian graphical models. Journal of the American Statistical Association, 110(509):159 174, 2015. [25] S. Roy. Car insurance data. https://www.kaggle.com/datasets/sagnik1511/ car-insurance-data, 2021. [26] B. Settles. Active learning literature survey. 2009. [27] B. Shahriari, K. Swersky, Z. Wang, R. P. Adams, and N. De Freitas. Taking the human out of the loop: A review of bayesian optimization. Proceedings of the IEEE, 104(1):148 175, 2015. [28] H. Shim, S. J. Hwang, and E. Yang. Why pay more when you can pay less: A joint learning framework for active feature acquisition and classification. ar Xiv preprint ar Xiv:1709.05964, 2017. [29] A. Sudarshan. Customer segmentation data. https://www.kaggle.com/datasets/ abisheksudarshan/customer-segmentation, 2021. [30] B. Ustun, Y. Liu, and D. Parkes. Fairness without harm: Decoupled classifiers with preference guarantees. In International Conference on Machine Learning, pages 6373 6382. PMLR, 2019. [31] C. K. Williams and C. E. Rasmussen. Gaussian processes for machine learning, volume 2. MIT press Cambridge, MA, 2006. [32] L. Xie, K. Lin, S. Wang, F. Wang, and J. Zhou. Differentially private generative adversarial network. ar Xiv preprint ar Xiv:1802.06739, 2018. [33] S. Yu, B. Krishnapuram, R. Rosales, and R. B. Rao. Active sensing. pages 639 646, 16 18 Apr 2009. A Missing proofs Proposition 1. Given a core feature set R S with failure probability δ < 0.5, then there exists a function ϵ : R R that is monotonic decreasing function with ϵ(1) = 0 such that: H fθ(XU, XR = x R) ϵ(1 δ), where H[Z]= P z [L] Pr(Z = z) log Pr(Z = z) is the entropy of the random variable Z. Proof. In this proof, we demonstrate the binary classification case. The extension to a multi-class scenario can be achieved through a similar process. By the definition of the core feature set, there exists a representative label, denoted as y {0, 1} such that the probability of P(fθ(XU, XR = x R) = y) is greater than or equal to 1 δ. Without loss of generality, we assume that the representative label is y = 1. Therefore, if we denote Z as the probability of Pr(fθ(XU, XR = x R) = 1), then the probability of Pr(fθ(XU, XR = x R) = 0) = 1 Z. Additionally, we have Z 1 δ > 0.5 due to the definition of core feature set and by the assumption that δ < 0.5. The entropy of the model s prediction can be represented as: H fθ(XU, XR = x R) = Z log Z (1 Z) log(1 Z). Choose ϵ(Z) = Z log Z (1 Z) log(1 Z). The derivative of ϵ(Z) is given by dϵ(Z) d Z = log 1 Z Z < 0, as Z > 0.5. As a result, ϵ(Z) is a monotonically decreasing function, so ϵ(Z) ϵ(1 δ) When δ = 0, we have Z = 1, and by the property of the entropy H fθ(XU, XR = x R) = 0. Proposition 2. Given two subsets R and R of sensitive features S, with R R , H fθ(XU, XR = x R) H fθ(XU , XR = x R ) , where U = S \ R and U = S \ R . Proof. This is due to the property that conditioning reduces the uncertainty, or the well-known information never hurts theorem in information theory [16]. Proposition 3. The conditional distribution of any subset of unrevealed features U U, given the the values of released features XR = x R is given by: Pr(XU |XR = x R) = N µU + ΣU ,RΣ 1 RR(x R µR), ΣU U ΣU RΣ 1 RRΣR,U , where Σ is the covariance matrix Proof. This is a well-known property of the Gaussian distribution and we refer the reader to Chapter 2.3.2 of the textbook [8] for further details. Proposition 4. The model predictions before thresholding, fθ(XU, XR = x R) = θUXU + θRx R is a random variable with a Gaussian distribution N mf, σf , where mf = θRx R + θ U µU + ΣURΣ 1 RR(x R µR) (8) σ2 f = θ U ΣUU ΣURΣ 1 RRΣRU θU, (9) where θU is the sub-vector of parameters θ corresponding to the unrevealed features U. Proof. The proof of this statement is straightforward due to the property that a linear combination of Gaussian variables XU is also Gaussian. Additionally, the posterior distribution of XU is already provided in Proposition 3. Proposition 5. Let the model predictions prior thresholding fθ(XU, XR = x R), be a random variable following a Gaussian distribution N(mf, σ2 f). Then, the model prediction following thresholding fθ(XU, XR = x R) is a random variable following a Bernoulli distribution Bern(p) with p = Φ( mf σf ), where Φ( ) refers to the CDF of the standard normal distribution, and mf and σf, are given in Equations (5) and (6), respectively. Proof. In the case of a binary classifier, we have fθ(x) = 1{ fθ(x) 0}. If f follows a normal distribution, denoted as f N(mf, σ2 f), then by the properties of the normal distribution, fθ follows a Bernoulli distribution, denoted as fθ Bern(p), with parameter p = Φ( mf σf ), where Φ( ) is the cumulative density function of the standard normal distribution. Proposition 6. Assume fθ is a linear classifier. Then, determining if a subset U of sensitive features S is a pure core feature set can be performed in O(|P| + |S|) time. Proof. As discussed in the main text, to test if a subset U is a core feature set or not, we need to check if the following two terms have the same sign (either negative or non-negative): max XU θ U XU + θ Rx R = θU 1 + θ Rx R min XU θ U XU + θ Rx R = θU 1 + θ Rx R. (10) These can be solved in time O(|P| + |S|) due to the property of the linear equality above. Theorem 1. The distribution of the random variable fθ = fθ(XU, XR = x R) where XU N µpos U , Σpos U can be approximated by a Normal distribution as fθ N fθ(XU = µpos U , XR = x R), g U Σpos U g U (11) where g U = XU fθ(XU = µpos U , XR = x R) is the gradient of model prediction at XU = µpos U . Proof. The proof relies on the first Taylor approximation of classifier f around its mean: fθ(XU, XR = x R, ) fθ(XU = µpos U , XR = x R) + (XU µpos U )T XU fθ(XU = µpos U , XR = x R) (12) Since XU N µpos U , Σpos U hence XU µpos U N 0, Σpos U . By the properties of normal distribution, the right-hand side of Equation (12) is a linear combination of Gaussian variables, and it is also Gaussian. B Algorithms Pseudocode The pseudocode for Min DRel for non-linear classifiers is presented in Algorithm 2. There are two main differences between this algorithm and the case of linear classifiers. Firstly, unlike linear classifiers, the procedure of pure core feature testing on line 5 does not require the guarantee (see again Section 6.2). The accuracy of the testing procedures depends on the number of random samples that we evaluate. The greater the number of drawn samples, the more likely the testing procedure is to be accurate. During experiments, we draw 105 samples to perform the testing. Additionally, we use Theorem 1 to estimate the distribution of the soft prediction as seen on line 11, as the exact distribution cannot be computed analytically as in the case of linear classifiers. C Extension from binary to multiclass classification In the main text, we provide the implementation of Min DRel for binary classification problems. In this section, we extend the method to the multiclass classification problem. C.1 Estimating P(fθ(XU, XR = x R)) In order to achieve our goals of determining if a subset is a core feature set for a given δ > 0, and computing the entropy in the scoring function, we need to estimate the distribution of fθ(XU, XR = x R). In this section, we first discuss the method of computing the distribution of fθ(XU, XR = x R) for both linear and non-linear models. Once this is done, we then address the challenge of estimating the hard label distribution P(fθ(XU, XR = x R)). It is important to note that, under the assumption that the input features X are normally distributed with mean µ and covariance matrix Σ, the linear classifier fθ = θ x will also have a multivariate Algorithm 2: Min DRel for non-linear classifiers input :A test sample x; training data D output :A core feature set R and its representative label y 1 µ 1 |D| P 2 Σ 1 |D| P (x,y) D(x µ)(x µ) 3 Initialize R = 4 while True do 5 if R is a core feature set with repr. label y then 6 return (R, y) 8 foreach j U do 9 Compute Pr(Xj|XR = x R) (using Prop. 3) 10 Z sample(Pr(Xj|XR = x R)) T times 11 Compute Pr fθ(Xj = z, XU\{j}XR = x R) ( using Theorem 1) 12 Compute F(Xj) (using Eq. (4)) 13 j argmaxj F(Xj) 14 R R {j } 15 U U \ {j } normal distribution. Specifically, if XU N(µpos U , Σpos U ), then fθ(XU, XR = x R) N(θ Rx R + θT Uµpos U , θ U ΣθU). For non-linear classifiers, the output fθ(XU, XR = x R) is not a Gaussian distribution due to the non-linear transformation. To approximate it, we use Theorem 1 which states that the non-linear function fθ(XU, XR = x R) can be approximated as a multivariate Gaussian distribution. Challenges when estimating P(fθ(XU, XR = x R)). For multi-class classification problems, the hard label fθ(XU, XR = x R) is obtained by selecting the class with the highest score, which is given by argmaxi [L] f i θ(XU, XR = x R). However, due to the non-analytical nature of the argmax function, even when fθ(XU, XR = x R) follows a Gaussian distribution, the distribution of fθ(XU, XR = x R) cannot be computed analytically. To estimate this distribution, we resort to Monte Carlo sampling. Specifically, we draw a number of samples from P( fθ(XU, XR = x R)), and for each class y Y we approximate the probability P(fθ(XU, XR = x R) = y) as the proportion of samples that fall in that class y. We provide experiments of Min DRel for multi-class classification cases in Section D.5. D Experiments details Datasets information. To show the advantages of the suggested Min DRel technique for safeguarding feature-level privacy, we employ benchmark datasets in our experiments. These datasets include both binary and multi-class classification datasets. The proposed method was evaluated on the following datasets for binary classification tasks: 1. Bank dataset [9]. The objective of this task is to predict whether a customer will subscribe to a term deposit using data from various features, including but not limited to call duration and age. There are a total of 16 features available for this analysis. 2. Adult income dataset [9]. The goal of this task is to predict whether an individual earns more than $50,000 annually. After preprocessing the data, there are a total of 40 features available for analysis, including but not limited to occupation, gender, race, and age. 3. Credit card default dataset [9]. The objective of this task is to predict whether a customer will default on a loan. The data used for this analysis includes 22 different features, such as the customer s age, marital status, and payment history. 4. Car insurance dataset [25]. The task at hand is to predict whether a customer has filed a claim with their car insurance company. The dataset for this analysis is provided by the insurance company and includes 16 features related to the customer, such as their gender, driving experience, age, and credit score. Furthermore, we also evaluate the proposed method on two additional multi-class classification datasets: 1. Customer segmentation dataset [29]. The task at hand is to classify a customer into one of four distinct categories: A, B, C, and D. The dataset used for this task contains 9 different features, including profession, gender, and working experience, among others. 2. Children fetal health dataset [19]. The task at hand is to classify the health of a fetus into one of three categories: normal, suspect, or pathological, using data from CTG (cardiotocography) recordings. The data includes approximately 21 different features, such as heart rate and the number of uterine contractions. Settings. For each dataset, 70% of the data will be used for training the classifiers, while the remaining 30% will be used for testing. The number of sensitive features, denoted as |S|, will be chosen randomly from the set of all features. The remaining features will be considered public. 100 repetition experiments will be performed for each choice of |S|, under different random seeds, and the results will be averaged. All methods that require Monte Carlo sampling will use 100 random samples. The performance of different methods will be evaluated based on accuracy and data leakage. Two different classifiers will be considered. 1. Linear classifiers: We use Logistic Regression as the base classifier. 2. Nonlinear classifiers: The nonlinear classifiers used in this study consist of a neural network with two hidden layers, using the Re LU activation function. The number of nodes in each hidden layer is set to 10. The network is trained using stochastic gradient descent (SGD) with a batch size of 32 and a learning rate of 0.001 for 300 epochs. For Bayesian NN, we employ the package bayesian-torch [17] with the default settings. The base regressor is a neural network with one hidden layer that has 10 hidden nodes and a Re LU activation function. We train the network in 300 epochs with a learning rate of 0.001. Baseline models. We compare our proposed algorithms with the following baseline models: 1. All features: This refers to the usage of the original classifier which asks users to reveal all sensitive features. 2. Optimal: This method involves evaluating all possible subsets of sensitive features (2|S| in total) in order to identify the minimum pure core feature set. For each subset, the verification algorithm is used to determine whether it is a pure core feature set. The minimum pure core feature set that is found is then selected. It should be noted that as all possible subsets are evaluated, all sensitive feature values must be revealed. Therefore, this approach is not practical in real-world scenarios. However, it does provide a lower bound on data leakage for Min DRel (when δ = 0). Min DRel models. In Min DRel there are two important steps: (1) core feature set verification and (2) selection next feature to reveal. As additional baselines, we keep the core feature set verification and vary the selection process. We consider the following three feature selection methods: 1. F-Score: We choose the feature based on the amount of information on model prediction we gain after revealing one feature as provided in Equation 3. 2. Importance: We reveal the unknown sensitive features based on the descending order of feature importance until we find a core feature set. The feature importance is determined as follows. We firstly fit a Logistic Regression fθ(x) = 1{θT x 0} on the training dataset D using all features (public included). The importance of one sensitive feature i S is determined by θi 2. 3. Random: We reveal the unrevealed sensitive feature in random order until the revealed set is a core feature set. F-Score Importance Random 2 4 # Sensitive Features Data Leakage 2 4 # Sensitive Features 2 4 # Sensitive Features Optimal All features Min DRel( =0) Min DRel( =0.1) Min DRel( =0.2) (a) Bank dataset F-Score Importance Random 2 4 # Sensitive Features Data Leakage 2 4 # Sensitive Features 2 4 # Sensitive Features Optimal All features Min DRel( =0) Min DRel( =0.1) Min DRel( =0.2) (b) Income dataset F-Score Importance Random 2 4 # Sensitive Features Data Leakage 2 4 # Sensitive Features 2 4 # Sensitive Features Optimal All features Min DRel( =0) Min DRel( =0.1) Min DRel( =0.2) (c) Credit dataset F-Score Importance Random 2 4 # Sensitive Features Data Leakage 2 4 # Sensitive Features 2 4 # Sensitive Features Optimal All features Min DRel( =0) Min DRel( =0.1) Min DRel( =0.2) (d) Insurance dataset Figure 7: Comparison between using (left) our proposed F-Score (left) with Importance (Middle) and Random (Right) for different choices of the number of sensitive features |S|. The baseline classifier is Logistic Regression Metrics. We compare all different algorithms in terms of accuracy and data leakage: 1. Accuracy. For algorithms that are based on the core feature set, such as our Min DRel and Optimal, the representative label is used as the model s prediction. Again, the representative label for δ = 0 can be identified by using testing pure core feature set procedures. For δ > 0, the representative label is given by y = argmaxy Y R P(fθ(XU = x U, XR = x R) = y)P(XU|XR = x R)dxu. The accuracy is then determined by comparing this representative label to the ground truth. 2. Data leakage. We compute the percentage of the number of sensitive features that users need to provide on the test set. Small data leakage is considered better. D.1 Additional comparison between using Gaussian assumption and Bayesian NN We first show empirically the benefits of our proposed Gaussian assumption compared to using Bayesian NN which allows more flexibility in modeling the conditional distribution P(XU|XR = x R). We report both training and inference time between Bayesian NN and our Gaussian assumption on various datasets when the number of sensitive features |S| = 5 in Table 1 and Table 2. When |S| = 5 the number of possible subsets U S is 25 = 32 which requires training 32 Bayesian NN models. This will be especially slow for datasets with a large number of training samples (e.g., Income with 50K samples). In contrast, using Gaussian assumption we just need to precompute 32 inverse matrices Σ 1 R,R which is pretty fast for data that have a small number of features (less than 50 in our experiments). It is noted again that in this paper we focus on the case when the number of training samples is much more than the number of features. Likewise, during inference time, with Gaussian assumption, we can compute the distribution of model prediction in a closed form by simple matrix multiplication which takes O(d2). Instead, using Bayesian NN, it requires expensive Monte Carlo sampling, especially when |U| is large to obtain an accurate estimation of P(XU|XR = x R). F-Score Importance Random 2 4 # Sensitive Features Data Leakage 2 4 # Sensitive Features 2 4 # Sensitive Features Optimal All features Min DRel( =0) Min DRel( =0.1) Min DRel( =0.2) F-Score Importance Random 2 4 # Sensitive Features Data Leakage 2 4 # Sensitive Features 2 4 # Sensitive Features Optimal All features Min DRel( =0) Min DRel( =0.1) Min DRel( =0.2) F-Score Importance Random 2 4 # Sensitive Features Data Leakage 2 4 # Sensitive Features 2 4 # Sensitive Features Optimal All features Min DRel( =0) Min DRel( =0.1) Min DRel( =0.2) (a) Income dataset F-Score Importance Random 2 4 # Sensitive Features Data Leakage 2 4 # Sensitive Features 2 4 # Sensitive Features Optimal All features Min DRel( =0) Min DRel( =0.1) Min DRel( =0.2) (b) Insurance dataset Figure 8: Comparison between using (left) our proposed F-Score (left) with Importance (Middle) and Random (Right) for different choices of the number of sensitive features |S|. The baseline classifier is a neural network classifier. Method Bank Income Credit Insurance Bayesian NN 204 375 125 90 Gaussian assumption 0.01 0.02 0.02 0.01 Table 1: Comparison between using Bayesian neural network and our Gaussian assumption in terms of training time (minutes) when |S| = 5 for various datasets. We also report the performance in terms of accuracy and data leakage between using Gaussian assumption and Bayesian NN in Figure 9. We see no significant difference in terms of accuracy and data leakage between the two choices of modeling P(XU|XR = x R). In addition, as indicated above using the Gaussian assumption reduces significantly the training and inference time, in the subsequent experiments we will use the Gaussian assumption in Min RDel with F-Score selection. D.2 Additional experiments on linear binary classifiers Additional experiments were conducted to compare the performance of Min DRel to that of the baseline methods using linear classifiers on the Bank, Adult income, Credit, and Insurance datasets, as shown in Figure 7. As in the main text, a consistent trend in terms of performance is observed. Method Bank Income Credit Insurance Bayesian NN 40 254 220 34 Gaussian assumption 15 78 66 9 Table 2: Comparison between using Bayesian neural network and our Gaussian assumption in terms of inference time (minutes) on the test set when |S| = 5, δ = 0 for various datasets. Gaussian assumption Bayesian NN 2 3 4 5 # Sensitive Features Data Leakage 2 3 4 5 # Sensitive Features Min DRel( =0) Min DRel( =0.1) Min DRel( =0.2) (a) Bank dataset Gaussian assumption Bayesian NN 2 3 4 5 # Sensitive Features Data Leakage 2 3 4 5 # Sensitive Features Min DRel( =0) Min DRel( =0.1) Min DRel( =0.2) (b) Income dataset Gaussian assumption Bayesian NN 2 3 4 5 # Sensitive Features Data Leakage 2 3 4 5 # Sensitive Features Min DRel( =0) Min DRel( =0.1) Min DRel( =0.2) (c) Credit dataset Gaussian assumption Bayesian NN 2 3 4 5 # Sensitive Features Data Leakage 2 3 4 5 # Sensitive Features Min DRel( =0) Min DRel( =0.1) Min DRel( =0.2) (d) Insurance dataset Figure 9: Comparison between using Bayesian NN with our Gaussian assumption in terms of (1): accuracy and (2) data leakage for different choices of the number of sensitive features |S| on different datasets using a Logistic Regression classifier. As the number of sensitive attributes, |S|, increases, the data leakage introduced by Min DRel with various values of δ increases at a slower rate. With different choices of |S|, Min DRel (with δ = 0) requires the revelation of at most 50% of sensitive information. To significantly reduce the data leakage of Min DRel, the value of δ can be relaxed. As mentioned in the main text, δ controls the trade-off between accuracy and data leakage here. The larger δ is, the greater uncertainty the model prediction has, which implies the fewer sensitive features users need to reveal and the lower accuracy of the model prediction. By choosing an appropriate value for the failure probability, such as δ = 0.1, only minimal accuracy is sacrificed (at most 0.002%), while the data leakage can be reduced to as low as 5% of the total number of sensitive attributes. D.3 Additional experiments on non-linear binary classifiers Additional experiments were conducted to compare the performance of Min DRel to that of the baseline methods using non-linear classifiers on the Bank, Adult income, Credit, and Insurance datasets, as shown in Figure 8. As seen, while the baseline All features method requires the revelation of all sensitive attributes, Min DRel with different values of δ only requires the revelation of a much smaller number of sensitive attributes. The accuracy difference between the Baseline method and Min DRel is also minimal (at most 2%). These results demonstrate the effectiveness of Min DRel in protecting privacy while maintaining a good prediction performance for test data. D.4 Sclability of Min DRel for large |S| We demonstrate the performance of Min DRel when we have a large number of sensitive features |S|. Note that to reduce the runtime we did not run the Optimal method which performs an exponential search over all possible choices of the subset of S. We first report the accuracy and data leakage of Min DRel when using F-Score or using either two heuristic rules Importance and Random in the case of logistic regression classifiers in Figure 10. F-Score Importance Random 5 10 # Sensitive Features Data Leakage 5 10 # Sensitive Features 5 10 # Sensitive Features All features Min DRel( =0) Min DRel( =0.1) Min DRel( =0.2) (a) Bank dataset F-Score Importance Random 5 10 15 # Sensitive Features Data Leakage 5 10 15 # Sensitive Features 5 10 15 # Sensitive Features All features Min DRel( =0) Min DRel( =0.1) Min DRel( =0.2) (b) Income dataset F-Score Importance Random 5 10 15 # Sensitive Features Data Leakage 5 10 15 # Sensitive Features 5 10 15 # Sensitive Features All features Min DRel( =0) Min DRel( =0.1) Min DRel( =0.2) (c) Credit dataset F-Score Importance Random 5 10 # Sensitive Features Data Leakage 5 10 # Sensitive Features 5 10 # Sensitive Features All features Min DRel( =0) Min DRel( =0.1) Min DRel( =0.2) (d) Insurance dataset Figure 10: Comparison between using (left) our proposed F-Score (left) with Importance (Middle) and Random (Right) for different choices of number of sensitive features |S|. The baseline classifier is a logistic regression classifier. 10 15 # Sensitive Features Avg Pred Time(Seconds) Random Importance F-Score 7.5 10.0 12.5 # Sensitive Features Random Importance F-Score 6 8 10 12 # Sensitive Features Random Importance F-Score 10 15 # Sensitive Features Random Importance F-Score Figure 11: Comparison in terms of average prediction time (seconds) among F-Score, Importance and Random method in Min DRel (δ = 0) for different |S|. Finally, we report the average testing time (in seconds) to get the model prediction per user of Min DRel in Figure 11. It is noted that in this case, we assume the time taken by users to release sensitive features is negligible. It is evident that when |S| > 15, our proposed Min DRel with F-Score can take slightly more than 1 second to get the model prediction per user. This demonstrates the applicability of the models in practice. D.5 Evaluation of Min DRel on multi-class classifiers Linear classifiers We also provide a comparison of accuracy and data leakage between our proposed Min DRel and the baseline models for linear classifiers. These metrics are reported for the Customer and Children Fetal Health datasets in Figures 12a and 12b, respectively. The figures clearly show F-Score Importance Random Data Leakage All features Min DRel( =0) Min DRel( =0.1) Min DRel( =0.2) Optimal (a) Customer dataset F-Score Importance Random Data Leakage All features Min DRel( =0) Min DRel( =0.1) Min DRel( =0.2) Optimal (b) Children fetal health dataset Figure 12: Comparison between using our proposed F-Score (left) with Importance (Middle) and Random (Right) for different choices of the number of sensitive features |S|. The baseline classifier is a multinomial Logistic Regression. F-Score Importance Random 2 4 # Sensitive Features Data Leakage 2 4 # Sensitive Features 2 4 # Sensitive Features Min DRel( =0) All features Min DRel( =0.1) Min DRel( =0.2) Optimal (a) Customer dataset F-Score Importance Random 2 4 # Sensitive Features Data Leakage 2 4 # Sensitive Features 2 4 # Sensitive Features Min DRel( =0) All features Min DRel( =0.1) Min DRel( =0.2) Optimal (b) Children fetal health dataset Figure 13: Comparison between using our proposed F-Score (left) with Importance (Middle) and Random (Right) for different choices of the number of sensitive features |S|. The baseline classifier is a neural network classifier. the benefits of Min DRel in reducing data leakage while maintaining a comparable accuracy to the baseline models. Nonlinear classifiers Similarly, we present a comparison of our proposed algorithms with the baseline methods when using non-linear classifiers. These metrics are reported for the Customer and Children Fetal Health datasets in Figures 13a and 13b, respectively. The results show that using Min DRel with a value of δ = 0 results in a minimal decrease in accuracy, but significantly reduces the amount of data leakage compared to the Baseline method. E Exetended Motivation Finally, we delve deeper into the motivational aspects underscoring our research. Our study primarily concerns tabular datasets where the total number of features is significantly outnumbered by the training samples. We concentrate on applications necessitating user-provided personal information for decision-making, such as lending, online insurance services, and healthcare services. The prevailing weaknesses of these applications, which our work attempts to address, are outlined as follows: 1. Privacy Concerns: The primary issue stems from the need for users to disclose sensitive information. For instance, in the context of online healthcare services, patients are required to share an array of sensitive health data weight, height, smoking habits, etc. via a website or mobile application. This exposes users to potential privacy threats. 2. User and Organizational Expenditure of Time and Effort: Numerous applications, such as lending, involve the time-consuming and effort-intensive process of gathering sensitive information and its supporting evidence. Users, for example, are required to validate their income through payslips or employment contracts in lending applications. Similarly, the organization must invest time and resources to verify the authenticity of submitted documents. 3. Legal Constraints: According to the EU General Data Protection Regulation s principle of data minimization, the collection of excessive personal information by companies and organizations is restricted. Our primary text illustrates that it is not imperative to report all features to preserve the model s accuracy. Inference Time. We also emphasize the importance of low inference time from both user and business perspectives. From the user s viewpoint, applications such as online car insurance require answering a series of questions to determine the insurance plan. Naturally, users prefer answering fewer questions in the least amount of time. From a business standpoint, prolonged inference time may lead to customer dissatisfaction, potentially resulting in contract termination. This serves as the basis for our algorithmic choices, in lieu of more complex conditional distribution modeling methods, which can significantly increase inference time.