# ensur_equitable_and_statistically_unbiased_recommendation__d11c5a4c.pdf ENSUR: Equitable and Statistically Unbiased Recommendation Nitin Bisht 1 Xiuwen Gong 1 Guandong Xu 2 Although Recommender Systems (RS) have been well-developed for various fields of applications, they often suffer from a crisis of platform credibility with respect to RS confidence and fairness, which may drive users away, threatening the platform s long-term success. In recent years, some works have tried to solve these issues; however, they lack strong statistical guarantees. Therefore, there is an urgent need to solve both issues with a unifying framework with robust statistical guarantees. In this paper, we propose a novel and reliable framework called Equitable and Statistically Unbiased Recommendation (ENSUR)) to dynamically generate prediction sets for users across various groups, which are guaranteed 1) to include ground-truth items with user-predefined high confidence/probability (e.g., 90%); 2) to ensure user fairness across different groups; 3) to have minimum efficient average prediction set sizes. We further design an efficient algorithm named Guaranteed User Fairness Algorithm (GUFA) to optimize the proposed method and derive upper bounds of risk and fairness metrics to speed up optimization process. Moreover, we provide rigorous theoretical analysis concerning risk and fairness control and minimum set size. Extensive experiments validate the effectiveness of the proposed framework, which aligns with our theoretical analysis. 1. Introduction Recommender Systems (RS) (Aggarwal, 2016; Fan et al., 2022; Sharma et al., 2024) are a type of information filtering system designed to provide suggestions to users based on their preferences. While much effort goes into improving accuracy of these recommendation models, less attention has been paid to model confidence, affecting users trust in 1University of Technology, Sydney 2The Education University of Hong Kong. Correspondence to: Xiuwen Gong , Guandong Xu . Proceedings of the 42 nd International Conference on Machine Learning, Vancouver, Canada. PMLR 267, 2025. Copyright 2025 by the author(s). the platform s credibility. In recent years, few recommendation approaches (Naghiaei et al., 2022; KWEON et al., 2024) are developed for model confidence. However, these methods are heuristic modeling without statistical guarantee. Meanwhile, fairness is another critical issue that may harm user experience and undermine platform reliability. Some fairness-based recommendation models have been developed in recent years (Li et al., 2023; Han et al., 2024). While these papers alleviate fairness issues in recommendation systems, they are typically empirically validated without statistical guarantees for both performance and fairness. As a result, we are motivated to develop a complete and statistically guaranteed recommendation framework that considers both model confidence and fairness issues as a whole in this paper. Our overall goal is to construct set predictors that can generate minimum prediction set for each user while guaranteeing model confidence and ensuring user fairness among different groups. Thus, objectives of our framework are threefold: (1) to construct prediction sets that cover true item with high user pre-defined probability, say 90% (i.e., confidence level); (2) to guarantee user fairness across different groups; and (3) to guarantee minimum average set size while ensuring (1) and (2). Inspired by Risk-Controlling Prediction Sets (RCPS) (Bates et al., 2021b) - a powerful statistical tool, we propose a reliable and fair framework called Equitable and Statistically Unbiased Recommendation (ENSUR)) to achieve the above-mentioned objectives. However, RCPS in its natural form, is designed only to ensure coverage guarantees. As a result, it does not address our key objectives, specifically: 1) how to guarantee fairness among different user groups definitions in a statistical way? 2) how to improve the efficiency of constructing prediction sets when the search range is so large? 3) how to produce recommendation sets with minimum size? 4) how to theoretically guarantee the constructed prediction sets meet the risk control and fairness definition as well as minimum set size? To address these gaps, we first define an estimator called fairness metric, which is required to meet the Fairness-Controlling Prediction Sets (FCPS) defined in a similar way as that of risk control. We then build our objective function by minimizing average prediction set while making it meet both RCPS and FCPS constraints for all users across different groups. Subsequently, we derive upper bounds for both the risk and fairness to accelerate ENSUR: Equitable and Statistically Unbiased Recommendation Figure 1. The proposed ENSUR Framework. Red check marks indicate the true relevant items. optimization process for prediction set construction. Lastly, we provide theoretical analysis to prove effectiveness of set predictors with respect to RCPS and FCPS, and minimum set size. The proposed framework is depicted in Figure 1. Our contributions are summarized as follows: Firstly, we formulate the recommendation problem from statistically guaranteed perspectives in terms of risk control guarantee and fairness control guarantee, and propose a reliable and fair recommendation framework, i.e., Equitable and Statistically Unbiased Recommendation (ENSUR)), which is able to construct minimum prediction set while ensuring the risk control and fairness guarantee for all users in different groups. Secondly, we design an efficient optimization algorithm, i.e., Greedy User Fairness Algorithm (GUFA) to optimize the objective function of ENSUR. To accelerate the optimization process, we derive the upper bounds for both the defined expected risk and fairness metric via concentration inequalities in Theorem 4.1 and Theorem 4.2 and then make them approach their respective thresholds in a greedy way. Next, we establish rigorous theoretical guarantees for the proposed framework ENSUR. We prove that the constructed prediction set can achieve risk control and fairness guarantees in Theorem 5.1 while achieving minimal set sizes in Theorem 5.2, which theoretically verifies the effectiveness of ENSUR. Finally, we conduct comprehensive experiments on top of five commonly used recommendation models and various datasets across multiple domains and fairness definitions, demonstrating the empirical efficiency and effectiveness of the proposed ENSUR, which also aligns with our theoretical analysis. 2. Related Works 2.1. Recommendation Recommender systems (RS) (Ko et al., 2022; Lu et al., 2015) help users make decisions via personalized content in different fields of application, such as e-commerce (Schafer The code and implementation details are available at https://github.com/kalpiree/ENSUR et al., 1999), media streaming (Chang et al., 2017), social networks (He et al., 2024) etc. Credibility and fairness are two crucial factors in ensuring the satisfaction of customers and the long-term success of these systems. Traditional recommendation models primarily focused on accuracy (Adomavicius & Tuzhilin, 2005; Ricci et al., 2010), however, aligning with broader trends in machine learning (Huang et al., 2021; Liu et al., 2019; Zou & Liu, 2023), there is a growing appreciation that model confidence, the reliability of a recommendation, is equally important. However, most of these methods are heuristic modeling without statistical guarantees (Naghiaei et al., 2022). Meanwhile, some fairness-based recommendation models have been developed in recent years, which usually focus on a particular fairness issue in specific fields of application. Fairness in RS can be viewed from diverse perspectives (Li et al., 2023). One such perspective is Individual fairness and Group fairness. Individual fairness requires that similar individuals receive comparable treatment. However, defining this similarity is challenging due to disagreements over task-specific similarity metrics (Dwork et al., 2011). Group fairness, on the other hand, ensures that protected groups receive treatment comparable to that of advantaged groups or the general population (Pedreschi et al., 2009), thus ensuring equitable treatment across predefined groups. It can be further classified from the user side or item/platform side. Focusing on User-Side group Fairness, it can be defined based on sensitive features like age, gender, race, etc. (Yao & Huang, 2017) utilized gender to distinguish between advantaged and disadvantaged user groups and measured prediction discrepancies. Another approach utilizes differentiating groups based on user interactions as defined by (Li et al., 2021a) and (Abdollahpouri et al., 2019). To ensure fairness, existing works apply several techniques such as regularization and constrained optimization (Li et al., 2021a; Islam et al., 2021). Some other approaches use Reinforcement Learning by formulating the problem as a Constrained Markov Decision Process (Ge et al., 2021; 2022). To evaluate the fairness, (Yao & Huang, 2017) introduced four group metrics to evaluate collaborative filtering recommender models. (Fu et al., 2020) employed the Group Recommendation Unfairness (GRU) metric to assess disparities across these user groups. Rahmani et al. (2022) depicted this approach balances fairness with utility under certain conditions. Unlike ENSUR: Equitable and Statistically Unbiased Recommendation the more robust statistical frameworks utilized in general machine learning works (Gong et al., 2023b;a; 2021), these fairness methods do not have a notion of statistical guarantees. Addressing that gap is focus of our work. 2.2. Risk-Controlling Prediction Sets We develop uncertainty quantification for the model confidence and fairness based on Risk-Controlling Prediction Sets (RCPS) (Bates et al., 2021b). RCPS is a general framework, not a specific algorithm, for producing predictive sets that satisfy the risk control in Definition 1. Different contexts require different designs of risk or other estimators to achieve best performance. For example, in the context of medical diagnosis, if set S(X) represents plausible diagnoses based on patient features X and R(S) is expected risk of loss from missing true diagnoses, then RCPS ensures this risk to remain below α with confidence 1 δ. This enables doctors to automatically screen for many diseases (e.g., via a blood sample) and refer the patient to relevant specialists. We will apply framework of RCPS to the designed risk and fairness in the context of recommendation. Definition 1 (Risk-controlling prediction sets (RCPS) (Bates et al., 2021b)). Let S be a random function taking values in space of functions X Y (e.g., a functional estimator trained on data). We say that S is a (α, δ) RCPS if, with probability at least 1 δ, we have R(S) α. 3. The Proposed Framework In this section, we formulate the objective functions that our framework, i.e., Equitable and Statistically Unbiased Recommendation (ENSUR)), aims to achieve. Firstly, we introduce the notations used in the paper. Consider n items, denoted as i = [i]n j=1, where each item ij is an element of the item space I. Similarly, we have m users, represented by u = [u]m k=1, where each user uk belongs to the user space U. For brevity, we use u and i for user and item, respectively. The group information G of each user u is known, and following (Li et al., 2021b), we partition users into two groups, G1 and G2, such that G1 G2 = and G1 G2 = U to ensure exclusivity. Here, G1 and G2 represent the advantaged and disadvantaged groups, respectively. The recommendation is conducted via relevance model m : U I [0, 1], which maps a user u and an item i to an estimate score m(u, i), and items with the highest scores are usually the most relevant recommendations. However, there is no theoretical guarantee to ensure the confidence of the model s output, and so the reliability of the recommended items remains uncertain. In the following, we will follow the framework of Risk-Controlling Prediction Sets (RCPS) Bates et al. (2021a) to solve this gap. We define our set predictor to be ϕ : u i , where i I is a set-valued output guided by parameter λ. This lambda takes values in a closed set Λ R such that ϕ(.) is nested i.e., λ1 < λ2 = ϕλ2(u) ϕλ1(u). (1) Considering the recommendation setting with implicit feedback (Hu et al., 2008; Zhu et al., 2024), we define the loss function between the relevant item itrue of user u and the prediction set ϕλ(u) to be 0-1 loss as follows: L(itrue, ϕλ(u)) = ( 1 if itrue / ϕλ 0 if itrue ϕλ. (2) Using (Bates et al., 2021b), the loss function L(itrue, ϕλ(u)) is assumed to also satisfy the following property: ϕλ1(u) ϕλ2(u) = L(itrue, ϕλ1(u)) L(itrue, ϕλ2(u)). (3) Based on the above loss function, we define the expected risk of not including a ground-truth item in the prediction set for all users as follows: R(λG) = E(L(itrue, ϕλG(u))). (4) Subsequently, we require defined risk to meet riskcontrolling prediction sets (RCPS), which ensures the probability of risk lower than user-specified threshold α is no less than user-defined confidence level 1 δ, namely, reliability of recommendation. This can be formulated as follows: Pr(R(λG) α)) 1 δ. (5) Meanwhile, fairness among users in the advantaged groups and the disadvantaged groups is another challenge that needs to be tackled. Notably, in recommendation settings, advantaged or disadvantaged can stem from various factors such as demographics, engagement patterns, or other domain-specific attributes. Thus, we define a fairness metric F( ) via the difference between the normally used recommendation metric (such as hit rate (HR) and DCG) of the advantaged group G1 and the disadvantaged group G2, to evaluate user fairness as follows: F(λG1, λG2) := u G1 M(ϕλG1(u)) u G2 M(ϕλG2(u)) Here, M( ) denotes generalized function representing recommendation metric (such as HR or DCG) that measures performance of recommendation set ϕλG(u) for any user u. For example, when we use hit rate (HR) or DCG as the recommendation metric, we can express them as: HR(Gi) = 1 |Gi| u Gi I(relevant item in ϕλGi(u)), ENSUR: Equitable and Statistically Unbiased Recommendation DCG(Gi) = 1 |Gi| u Gi DCG.(ϕλGi(u)), Thus, the fairness metrics can be expressed as: HR = |HR(G1) HR(G2)| , DCG = |DCG(G1) DCG(G2)| . This design makes our proposed framework more flexible by accommodating different types of RS metrics and diverse user-group definitions. Similarly, we require the defined fairness metric to meet the fairness-controlling prediction sets (FCPS), that is, the probability of the fairness metric lower than a user-specified threshold η is no less than user-pre-defined confidence level 1 ˆδ, namely, the reliability of fairness. The detailed formulation can be expressed as follows: Pr( F(λG1, λG2) η) 1 ˆδ. (7) Moreover, we hope constructed prediction sets to be as small as possible while they meet the risk-controlling guarantee as well as the fairness-controlling guarantee. This is because a smaller but more relevant set not only reduces uncertainty in recommendations (Coscrato & Bridge, 2023) but also enhances user satisfaction and eases cognitive load (Chen et al., 2022), ultimately improving the usability and effectiveness of the RS. Therefore, our goal is to find the optimal (λG1, λG2) that minimizes the average size of the recommendation sets, satisfying the risk (coverage) and fairness guarantees for all users in groups G1 and G2. The objective function can be formulated as follows: arg min (λG1,λG2) u G |ϕλG(u)| s.t. Pr(R(λG) α)) 1 δ for all G {G1, G2}, Pr( F(λG1, λG2) η) 1 ˆδ. (8) Here, α and η are the user pre-specified risk and fairness thresholds, say 10%; 1 δ and 1 ˆδ are the user pre-defined confidence level for the risk and fairness, say 90%. 4. The Optimization Algorithm To optimize the objective function in Equation (8), we need to ensure the risk and fairness metric in the constraints are below decision-makers pre-defined value α and η respectively, and finally obtain the optimal prediction set with minimum size. It is not efficient to directly apply the greedy algorithm as the range of risk and fairness values that approach the threshold α and η by adjusting the (λG1, λG2) is very large. If we can derive the upper bounds of both risk and fairness metric and take values at their corresponding upper bounds R+ G(λG, δ) and F +(λG1, λG2, ˆδ) respectively, then it will become more efficient to approach the threshold α and η by adjusting the (λG1, λG2). Following the upper bound strategy to accelerate the optimization procedures in (Bates et al., 2021b), we have the optimized risk constraint as follows: Pr(R(λG) R+(λG, δ)) 1 δ and R+ G(λG, δ) α, for all G {G1, G2}. (9) Similarly, the optimized fairness metric constraint can be reformulated as follows: Pr( F(λG1, λG2) F +(λG1, λG2, ˆδ)) 1 ˆδ and F +(λG1, λG2, ˆδ) η. (10) Consequently, we can choose ˆλ as the largest value of λ such that the entire confidence region to the left of λ falls below the target risk level α and η, and the set size will achieve the minimum value. The optimized objective function can be formulated as follows: (ˆλG1, ˆλG2) = sup λG1, λG2 [0, 1] : R+(λG, δ) α, F +(λG1, λG2, ˆδ) η . (11) To optimize the above objective function and output the optimal solution for (ˆλG1, ˆλG2) that dominate the validity of set predictor, we design a novel greedy-strategy-based algorithm called Greedy User Fairness Algorithm (GUFA). The complete procedures of the optimization algorithm are summarized in Algorithm 1. However, it still remains unknown that what the upper bounds of risk and fairness metric look like. In the following part, we will derive the upper bounds in Theorem 4.1 and Theorem 4.2 respectively. Theorem 4.1 ( Upper Bound for Risk). Assume loss function L(itrue, ϕλG(u)) follows a Bernoulli distribution, then upper bound for the risk R(λG) can be found as follows: R+(λG, δ) = sup n ˆR(λG) : Binom CDF(n ˆR(λG), n, α) δ o (12) where n is the number of samples; G {G1, G2}; ˆR(λG) denotes the empirical risk of R(λG), which can be calculated as follows: ˆR(λG) = 1 |G| u G L(itrue, ϕλG(u)). (13) Here, |G| denotes the number of users in group G. Proof. Proof can be found in Appendix B.1. ENSUR: Equitable and Statistically Unbiased Recommendation Algorithm 1 Guaranteed User Fairness Algorithm (GUFA) 1: Initialization: 2: Initialize control parameters for two groups λG1, λG2 3: Initialize user pre-specified parameters α, η, δ, ˆδ, 1, 2 4: Define Loss as in Equation (2) 5: Define Fairness metric as in Equation (6) 6: Adjustment Loop: 7: for users in each group G {G1, G2} do 8: Calculate R+ G(λG, δ) such that Pr(R(λG) R+ G(λG, δ)) 1 δ 9: Compute F(λG1, λG2) and calculate F +(λG1, λG2, ˆδ) such that Pr( F(λG1, λG2) F +(λG1, λG2, ˆδ)) 1 δ 10: if R+ G(λG, δ) > α OR F +(λG1, λG2, ˆδ) > η then 11: Update λG1 λG1 1, λG2 λG2 2 12: end if 13: end for 14: ˆλG1, ˆλG2 λG1, λG2 Get the optimal λG1, λG2 15: Construct Prediction Sets: 16: for each user u in group G do 17: ϕˆλG(u) {i | m(u, i) ˆλG} 18: end for 19: Output: the optimal solution ˆλG1, ˆλG2 and prediction sets ϕˆλG1(u) and ϕˆλG2(u) for all users in different groups. Theorem 4.2 (Upper Bound for Fairness Metric). The upper bound for fairness metric F(λG1, λG2) can be derived by applying Bernstein inequality (Maurer & Pontil, 2009) as follows: F +(λG1, λG2, ˆδ) = F(λG1, λG2)+ v u u t2σ2 F log 2 ˆδ n1 + n2 . (14) where n1 and n2 denote the number of samples for group G1 and G2; σ2 F denotes the variance associated with the fairness metric HR or DCG. The detailed formulation of the variance can be referred to in Appendix A.1. Proof. Proof can be found in Appendix B.2. Recommendation After obtaining optimal (ˆλG1, ˆλG2) from Algorithm 1, we can recommend new items for users. For example, when user u comes, we first decide on group G that they belong to and then utilize corresponding ˆλG to calculate their prediction set via step 17. 5. Theoretical Analysis In this section, we provide theoretical analysis on the risk and fairness control guarantee in Theorem 5.1, as well as the minimum set size guarantee in Theorem 5.2. Theorem 5.1 (Risk and Fairness Control Guarantee). For all group G {G1, G2} and δ (0, 1), with probability of at least 1 δ for risk threshold α, and with probability of at least 1 ˆδ for fairness threshold η, we have: Pr(R(ˆλG) α) 1 δ Pr( F(ˆλG1, ˆλG2) η) 1 ˆδ. (15) Proof. Proof can be found in Appendix B.3. Remark. In Theorem 5.1, we prove that the optimal ˆλG1, ˆλG2 obtained from Algorithm 1 are indeed able to control the expected risk to below the decision makers defined values of α with confidence 1 δ, and control the fairness metric F to below the decision makers defined values of η with confidence 1 ˆδ. This theoretically validate the recommendation reliability and fairness of the proposed ENSUR framework. Theorem 5.2 (Minimum Set Size Guarantee). Let (ϕλ G1, ϕλ G2) be any set predictor and let (ϕˆλG1, ϕˆλG2) be the optimal predictor obtained from Algorithm 1 such that R(λ G) R(ˆλG) and F(λ G1, λ G2) F(ˆλG1, ˆλG2). Then for each G {G1, G2}, we have: E h |ϕˆλG(u)| i E |ϕλ G(u)| . (16) where |ϕˆλG(u)| denotes the predicted set size for any user u in group G. Proof. Proof can be found in Appendix B.4. Remark. In Theorem 5.2, we prove that set predictor learned by our algorithm can output the minimal prediction set size for any user u in group G, which theoretically validate the effectiveness of the proposed ENSUR framework. To sum up, set predictors constructed by Algorithm 1 can modify any black-box recommendation models to output prediction sets for new customers that are strictly guaranteed to satisfy the risk control as defined in Equation (5) and the fairness control defined in Equation (7) while ensuring the minimum prediction sets in Equation (8). 6. Experiments In this section, we conduct experiments to validate the effectiveness of the proposed framework (ENSUR). We design ENSUR: Equitable and Statistically Unbiased Recommendation experiments to 1) validate whether the framework can provide desired coverage guarantee in terms of risk, better performance in terms of average set size, and improved fairness in terms of Hit Rate Difference (Hit Rate Diff.) and DCG Difference (DCG Diff.) across various datasets with sensitive attributes; 2) analyze how the parameters ( i.e. α, δ, η and ˆδ) influence the performance; 3) analyze the time-efficiency of ENSUR compared to other fairness baselines. 6.1. Datasets and Base Models We conduct experiments on four datasets with specific sensitive user attributes: (1) Amazon Office dataset (e Commerce) (Mc Auley et al., 2015) grouped by item interactions; (2) Last.fm dataset (music streaming) (Cantador et al., 2011) grouped by region (developed and other countries; (3) Movie Lens dataset (movie ratings) (Harper & Konstan, 2015) grouped by gender; and (4) Book-Crossing dataset (book ratings) (Ziegler et al., 2005) grouped by age. We implement the proposed framework on five base recommendation models: Deep FM (Guo et al., 2017), GMF (Koren et al., 2009), MLP (Zhang et al., 2019), Neu MF (He et al., 2017), and Light GCN (He et al., 2020). Additionally, we compare our framework ENSUR with four fairness baselines: 1) NFCF (Islam et al., 2021) 2) MFCF (Islam et al., 2021) 3) GMF-UFR (Li et al., 2021a) 4) NCF-UFR (Li et al., 2021a). The implementation details and the details of all the datasets, base models, and fairness baselines can be found in Appendices C and D. 6.2. Experimental Results 6.2.1. RESULTS W.R.T PERFORMANCE AND FAIRNESS We compare the performance and fairness of the ENSUR framework with five base recommendation models and four fairness baselines. We set the predefined risk threshold α = 0.20, fairness threshold η = 0.20 via manual validation. The error rates δ = 0.1 and ˆδ = 0.1 are representatively set following (Bates et al., 2021b). The coverage guarantee is measured in terms of risk; performance is measured using average set size, and fairness is compared using disparity in these metrics between user groups (Difference in Hit Rate and Difference in DCG). The results for the Amazon Office dataset (grouped by interactions) are provided in Table 1 whereas results for Movie Lens dataset (grouped by gender), Last.f M dataset (grouped by region), and Book-Crossing dataset (grouped by age) are provided in Tables 3 to 5 respectively in Appendix E. The results, presented in Table 1 lead us to the following key observations: The ENSUR framework ensures that all base models generate prediction sets that satisfy both risk control Table 1. Performances and fairness comparisons with base models and fairness baselines on Amazon Office Dataset grouped by the Interactions in terms of risk, average set size, and Hit Rate Diff/DCG Diff, respectively. Bold indicates best result, underline indicates the second best and marks threshold exceeded cases. Method Group Risk Average Set Size Hit Rate DCG Hit Rate Diff DCG Diff Deep FM 1 0.121 0.879 0.418 0.155 0.17 2 0.277 0.723 0.248 Deep FM + ENSUR 1 0.192 0.808 0.401 0.081 0.103 2 0.111 0.889 0.298 GMF 1 0.149 0.851 0.439 0.212 0.225 2 0.361 0.639 0.214 GMF + ENSUR 1 0.197 0.803 0.428 0.08 0.168 2 0.117 0.883 0.26 Light GCN 1 0.077 0.923 0.477 0.126 0.238 2 0.203 0.797 0.239 Light GCN + ENSUR 1 0.087 0.913 0.474 0.087 0.198 2 0 1 0.276 MLP 1 0.162 0.838 0.409 0.219 0.19 2 0.38 0.62 0.219 MLP + ENSUR 1 0.197 0.803 0.397 0.013 0.14 2 0.184 0.816 0.257 Neu MF 1 0.155 0.845 0.414 0.225 0.185 2 0.379 0.621 0.229 Neu MF + ENSUR 1 0.182 0.818 0.406 0.017 0.143 2 0.199 0.801 0.263 Other Fairness Baselines NFCF 1 0.196 28 0.804 0.391 0.115 0.134 2 0.261 0.689 0.257 MFCF 1 0.175 30 0.825 0.402 0.128 0.154 2 0.303 0.697 0.248 Neu MF-UFR 1 0.193 28 0.807 0.396 0.153 0.127 2 0.346 0.654 0.269 GMF-UFR 1 0.205 30 0.795 0.395 0.133 0.157 2 0.368 0.662 0.238 and fairness guarantees across all datasets. The ENSUR-enhanced models always meet risk below defined thresholds. For base models, the minimum risk threshold criteria is frequently not met. For example, in the Amazon Office Dataset, we notice, as depicted by , that risk thresholds are not met for at least one group, i.e., the disadvantaged group across all the base models. In fairness baselines, we observe criteria are not met for both the groups in most cases across all datasets, which may be because of their emphasis on trading off performance for accuracy. We also observe that the ENSUR-enhanced models can get the best results in average set size on all the datasets, but the best model varies among different datasets. For example, MLP + ENSUR achieves the best recommendations in terms of average set size on the Amazon Office dataset. Similar trends are observed for Movie Lens, Last.f M and Book-Crossing datasets as depicted in Tables 3 to 5 in Appendix E. All ENSUR-enhanced models meet the fairness threshold for both the Hit Rate Diff and DCG Diff across all datasets. However, the best-performing models vary by dataset. For example, MLP + ENSUR achieves the best fairness on the Amazon Office dataset under the Hit Rate Diff while Deep FM + ENSUR outperforms all the other models in terms of DCG Diff. Tables 3 to 5 in Appendix E depict similar trends for remaining datasets. In addition, the base models do not always achieve the fairness metrics and exceed the fairness threshold marked by . Meanwhile, the fairness base- ENSUR: Equitable and Statistically Unbiased Recommendation line models do achieve fairness metrics after sacrificing their accuracy, but they are still inferior to the ENSURenhanced models. Overall, the ENSUR framework effectively ensures both recommendation performance and fairness while guaranteeing risk control, providing valuable insights for real-world applications. We further discuss the generalizability of grouping strategies and practical applicability in Appendix F and Appendix G respectively. 6.2.2. PARAMETER ANALYSIS We further analyze influence of pre-defined risk-related parameters α and δ and fairness-related parameters η and ˆδ on the prediction sets generated by ENSUR framework. Effect of Risk Control Parameters α and δ on Prediction Set Sizes : We first evaluate the impact of error rate α varying from 0.10 to 0.50 (in increments of 0.05) on average prediction set sizes under fixed risk confidence thresholds δ = 0.05, 0.10, 0.15 using Amazon Office dataset, grouped by interactions in Figure 2. It can be easily observed that as α increases, the average set size across all models decreases. The decreasing trend demonstrates the framework s ability to generate valid prediction sets that adapt to the error rate α. Similar trends can be observed on remaining datasets, see Figures 6 to 8 in Appendix E.2. We further evaluate effect of varying risk confidence δ from 0.10 to 0.50 (in increments of 0.05) on average prediction set sizes under fixed risk thresholds (α = 0.15, 0.20, 0.25) using Book-Crossing dataset, grouped by age (see Figure 3). In general, all the models show a decreasing trend which validates effectiveness of the proposed framework. Interestingly, prediction set sizes do not seem to fluctuate much for smaller values of δ, while a decreasing trend occurs with increasing δ. This is because relaxing confidence of risk constraints makes our predictions less conservative, thereby reducing the number of items included in prediction set. Similar phenomenon can be obtained on the other datasets, see Figures 9 to 11 in Appendix E.2. Effect of Fairness Control Parameters η and ˆδ on Prediction Set Sizes : We analyze how varying η, measured by the Hit Rate Diff. and DCG Diff. from 0.10 to 0.50 (in increments of 0.05) on the average prediction set sizes under fixed fairness confidence (ˆδ = 0.15, 0.20, 0.25) affects average prediction set sizes, measured on the Movie Lens dataset grouped by gender (Figure 4). With increasing η, the prediction set size decreases, validating model s capacity to have smaller prediction sets for less strict η condition. The prediction set sizes usually stabilize after an initial decrease as η rises, suggesting that the framework s fairness sensitivity to η diminishes beyond a certain point. This offers guidance on selecting appropriate fairness thresholds while maintaining usability. Similar results can be observed on the other datasets, see Figures 12 to 14 in Appendix E.2. Finally, we examine trends on average prediction set sizes by varying fairness confidence ˆδ from 0.10 to 0.50 (in increments of 0.05) under fixed fairness thresholds (η = 0.15, 0.20, 0.25) measured on Last.fm dataset grouped by region (Figure 5). We notice that as value of ˆδ increases, for a given fairness threshold, model becomes less conservative, and hence prediction set size decreases. This phenomenon further validates effectiveness of our framework in balancing between producing tight average prediction set size and ensuring fairness. Similarly, results for other datasets can be found in Figures 15 to 17 in Appendix E.2. Overall, this parameter analysis guides real-world applications in balancing performance and fairness with confidence guarantees. 6.2.3. TIME EFFICIENCY COMPARISON We analyze the computational cost (training time) of the ENSUR framework in comparison with other fairness baselines. Specifically, for in-processing fairness baselines such as NFCF and MFCF, we consider the fine-tuning step to calculate the training time. For post-processinng fairness baselines such as Neu MF-UFR and GMF-UFR, we take the re-reranking step as the training time. For proposed ENSUR framework, we take the calibration step as the training time. We measure the time of ENSUR, averaged on top of all the base models. Our experiments are conducted via 10-fold cross validation to ensure statistical reliability. The results are presented in Table 2. From the results, we can observe that our proposed framework ENSUR is significantly more time-efficient than other fairness baselines, which indicates scalability of our method. This is because in-processing methods like NFCF and MFCF involves model refitting which substantially increases the computational cost. By contrast, ENSUR operates independently of the training phase, eliminating this overhead. Additionally, ENSUR is substantially faster than Neu MF-UFR and GMF-UFR, other post-processing methods, because these models involve solving a constrained and complex optimization problem, whereas ENSUR employs a simple yet effective greedy-based algorithm. Table 2. Training time (minutes) comparison of our framework ENSUR with four fairness baselines. Dataset MFCF NFCF GMF-UFR Neu MF-UFR ENSUR(Ours) Amazon Office 45 50 25 22 8 Movie Lens 75 90 49 45 12 Last.fm 50 58 35 30 8 Book-Crossing 110 135 68 65 15 ENSUR: Equitable and Statistically Unbiased Recommendation (a) δ = 0.05 (b) δ = 0.10 (c) δ = 0.15 Figure 2. Analysis of base models after applying the ENSUR framework in terms of average set size with varying α = {0.10, 0.15, 0.20, 0.25, 0.30, 0.35, 0.40, 0.45, 0.50} on Amazon Office dataset grouped by Interactions under different δ. (a) α = 0.15 (b) α = 0.20 (c) α = 0.25 Figure 3. Analysis of base models after applying the ENSUR framework in terms of average set size with varying δ = {0.10, 0.15, 0.20, 0.25, 0.30, 0.35, 0.40, 0.45, 0.50} on Book-Crossing dataset grouped by Age under different α. (a) ˆδ = 0.15 (b) ˆδ = 0.20 (c) ˆδ = 0.25 Figure 4. Analysis of base models after applying the ENSUR framework in terms of average set size with varying η = {0.10, 0.15, 0.20, 0.25, 0.30, 0.35, 0.40, 0.45, 0.50} on Movie Lens dataset grouped by Gender under different ˆδ. (a) η = 0.15 (b) η = 0.20 (c) η = 0.25 Figure 5. Analysis of base models after applying the ENSUR framework in terms of average set size with varying ˆδ = {0.10, 0.15, 0.20, 0.25, 0.30, 0.35, 0.40, 0.45, 0.50} on Last.fm dataset grouped by Region under different η. ENSUR: Equitable and Statistically Unbiased Recommendation 7. Conclusion This paper investigates two principle issues that affect the credibility of RS with respect to confidence and fairness. We integrate the two factors into a unified framework called Equitable and Statistically Unbiased Recommendation (ENSUR)), which dynamically outputs prediction sets that are guaranteed to have the risk and fairness below a threshold with pre-specified high confidence, such as 90%, while retaining the minimum average size. We conduct theoretical analysis and empirical studies, which are consistent in validating the effectiveness. It is noteworthy that the efficiency of optimizing the ENSUR also depends on the tightness of the derived upper bounds for our risk and fairness, thus, we leave the question whether there exists tighter upper bounds for the future work. Moreover, the proposed framework can work on top of any recommendation model by taking them as black-box, which offers a robust foundation for advancing fairness and reliability in RS, paving the way for future research and development in this field. Acknowledgments This work is partially supported by the Australian Research Council (ARC) Under Grants DP220103717 and LE220100078, and the National Natural Science Foundation of China under Grants No.62072257. Impact Statement Our framework dynamically tailors prediction set sizes in recommender systems, ensuring fairness and performance guarantees while reducing cognitive overload and resource inefficiencies. By addressing disparities in user experiences and promoting equitable recommendations, it advances inclusivity and transparency in user-centric platforms, with applications across e-commerce, streaming, and beyond. Abdollahpouri, H., Mansoury, M., Burke, R., and Mobasher, B. The unfairness of popularity bias in recommendation, 2019. Adomavicius, G. and Tuzhilin, A. Toward the next generation of recommender systems: a survey of the state-ofthe-art and possible extensions. IEEE Transactions on Knowledge and Data Engineering, 17(6):734 749, 2005. Aggarwal, C. C. Recommender Systems: The Textbook. Springer Publishing Company, Incorporated, 1st edition, 2016. Bates, S., Angelopoulos, A., Lei, L., Malik, J., and Jordan, M. Distribution-free, risk-controlling prediction sets. J. ACM, 68(6), sep 2021a. Bates, S., Angelopoulos, A., Lei, L., Malik, J., and Jordan, M. I. Distribution-free, risk-controlling prediction sets, 2021b. Cantador, I., Brusilovsky, P., and Kuflik, T. Second workshop on information heterogeneity and fusion in recommender systems (hetrec2011). In Proceedings of the Fifth ACM Conference on Recommender Systems, Rec Sys 11, pp. 387 388. Association for Computing Machinery, 2011. Chang, S., Zhang, Y., Tang, J., Yin, D., Chang, Y., Hasegawa-Johnson, M. A., and Huang, T. S. Streaming recommender systems. In Proceedings of the 26th International Conference on World Wide Web, WWW 17, pp. 381 389. International World Wide Web Conferences Steering Committee, 2017. Chen, D., Yan, Q., Chen, C., Zheng, Z., Liu, Y., Ma, Z., Yu, C., Xu, J., and Zheng, B. Hierarchically constrained adaptive ad exposure in feeds. In Proceedings of the 31st ACM International Conference on Information & Knowledge Management, CIKM 22, pp. 3003 3012. Association for Computing Machinery, 2022. Coscrato, V. and Bridge, D. Estimating and evaluating the uncertainty of rating predictions and top-n recommendations in recommender systems. ACM Trans. Recomm. Syst., April 2023. Dwork, C., Hardt, M., Pitassi, T., Reingold, O., and Zemel, R. Fairness through awareness, 2011. Fan, W., Liu, X., Jin, W., Zhao, X., Tang, J., and Li, Q. Graph trend filtering networks for recommendation. In Proceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR 22, pp. 112 121, 2022. Fu, Z., Xian, Y., Gao, R., Zhao, J., Huang, Q., Ge, Y., Xu, S., Geng, S., Shah, C., Zhang, Y., and de Melo, G. Fairness-aware explainable recommendation over knowledge graphs, 2020. Ge, Y., Liu, S., Gao, R., Xian, Y., Li, Y., Zhao, X., Pei, C., Sun, F., Ge, J., Ou, W., and Zhang, Y. Towards long-term fairness in recommendation. In Proceedings of the 14th ACM International Conference on Web Search and Data Mining, WSDM 21. ACM, March 2021. Ge, Y., Zhao, X., Yu, L., Paul, S., Hu, D., Hsieh, C.-C., and Zhang, Y. Toward pareto efficient fairness-utility tradeoff in recommendation through reinforcement learning. In Proceedings of the Fifteenth ACM International Conference on Web Search and Data Mining, WSDM 22. ACM, February 2022. ENSUR: Equitable and Statistically Unbiased Recommendation Gong, X., Yuan, D., and Bao, W. Understanding partial multi-label learning via mutual information. In Advances in Neural Information Processing Systems, volume 34, 2021. Gong, X., Yuan, D., and Bao, W. Discriminative metric learning for partial label learning. IEEE Transactions on Neural Networks and Learning Systems, 34(8):4428 4439, 2023a. Gong, X., Yuan, D., Bao, W., and Luo, F. A unifying probabilistic framework for partially labeled data learning. IEEE Transactions on Pattern Analysis and Machine Intelligence, 45(7):8036 8048, 2023b. Guo, H., Tang, R., Ye, Y., Li, Z., and He, X. Deepfm: A factorization-machine based neural network for ctr prediction, 2017. Gurobi Optimization, LLC. Gurobi Optimizer Reference Manual, 2024. URL https://www.gurobi.com. Han, Z., Chen, C., Zheng, X., Liu, W., Wang, J., Cheng, W., and Li, Y. In-processing user constrained dominant sets for user-oriented fairness in recommender systems. In Proceedings of the 31st ACM International Conference on Multimedia, MM 23, pp. 6190 6201. ACM, October 2023. Han, Z., Chen, C., Zheng, X., Zhang, L., and Li, Y. Hypergraph convolutional network for user-oriented fairness in recommender systems. In Proceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR 24, pp. 903 913. Association for Computing Machinery, 2024. Harper, F. M. and Konstan, J. A. The movielens datasets: History and context. ACM Trans. Interact. Intell. Syst., 2015. He, X., Liao, L., Zhang, H., Nie, L., Hu, X., and Chua, T.-S. Neural collaborative filtering, 2017. He, X., Deng, K., Wang, X., Li, Y., Zhang, Y., and Wang, M. Lightgcn: Simplifying and powering graph convolution network for recommendation. In Proceedings of the 43rd International ACM SIGIR conference on research and development in Information Retrieval, pp. 639 648, 2020. He, X., Liu, Q., and Jung, S. The impact of recommendation system on user satisfaction: A moderated mediation approach. Journal of Theoretical and Applied Electronic Commerce Research, 19:448 466, 02 2024. Hu, Y., Koren, Y., and Volinsky, C. Collaborative filtering for implicit feedback datasets. In 2008 Eighth IEEE international conference on data mining, pp. 263 272. Ieee, 2008. Huang, X., Du, B., and Liu, W. Multichannel color image denoising via weighted schatten p-norm minimization. In Proceedings of the Twenty-Ninth International Joint Conference on Artificial Intelligence, IJCAI 20, 2021. ISBN 9780999241165. Islam, R., Keya, K. N., Zeng, Z., Pan, S., and Foulds, J. Debiasing career recommendations with neural fair collaborative filtering. In Proceedings of the Web Conference 2021, WWW 21, pp. 3779 3790, New York, NY, USA, 2021. Association for Computing Machinery. Ko, H., Lee, S., Park, Y., and Choi, A. A survey of recommendation systems: recommendation models, techniques, and application fields. Electronics, 11(1):141, 2022. Koren, Y., Bell, R., and Volinsky, C. Matrix factorization techniques for recommender systems. Computer, 42(8): 30 37, 2009. KWEON, W., Kang, S., Jang, S., and Yu, H. Toppersonalized-k recommendation. In The Web Conference 2024, 2024. Li, Y., Chen, H., Fu, Z., Ge, Y., and Zhang, Y. User-oriented fairness in recommendation. In Proceedings of the Web Conference 2021, WWW 21. ACM, April 2021a. Li, Y., Chen, H., Fu, Z., Ge, Y., and Zhang, Y. User-oriented fairness in recommendation. In Proceedings of the Web Conference 2021, WWW 21, pp. 624 632. Association for Computing Machinery, 2021b. Li, Y., Chen, H., Xu, S., Ge, Y., Tan, J., Liu, S., and Zhang, Y. Fairness in recommendation: Foundations, methods and applications, 2023. Liu, W., Shen, X., Du, B., Tsang, I. W., Zhang, W., and Lin, X. Hyperspectral imagery classification via stochastic hhsvms. IEEE Transactions on Image Processing, 28(2): 577 588, 2019. doi: 10.1109/TIP.2018.2869691. Lu, J., Wu, D., Mao, M., Wang, W., and Zhang, G. Recommender system application developments: a survey. Decision support systems, 74:12 32, 2015. Ma, H., Xie, R., Meng, L., Feng, F., Du, X., Sun, X., Kang, Z., and Meng, X. Negative sampling in recommendation: A survey and future directions, 2024. Maurer, A. and Pontil, M. Empirical bernstein bounds and sample variance penalization, 2009. Mc Auley, J., Targett, C., Shi, Q., and van den Hengel, A. Image-based recommendations on styles and substitutes, 2015. ENSUR: Equitable and Statistically Unbiased Recommendation Naghiaei, M., Rahmani, H. A., Aliannejadi, M., and Sonboli, N. Towards confidence-aware calibrated recommendation. Proceedings of the 31st ACM International Conference on Information & Knowledge Management, 2022. Pedreschi, D., Ruggieri, S., and Turini, F. Measuring discrimination in socially-sensitive decision records. In Proceedings of the 2009 SIAM International Conference on Data Mining, pp. 581 592. SIAM, 2009. Rahmani, H. A., Naghiaei, M., Dehghan, M., and Aliannejadi, M. Experiments on generalizability of user-oriented fairness in recommender systems, 2022. Ricci, F., Rokach, L., and Shapira, B. Recommender Systems Handbook, volume 1-35, pp. 1 35. Springer-Verlag Berlin, Heidelberg, 10 2010. Santos, H. G. and Toffolo, T. Mixed integer linear programming with python. Accessed: Apr, 2020. Schafer, B., Konstan, J., and Riedl, J. Recommender systems in e-commerce. 1st ACM Conference on Electronic Commerce, Denver, Colorado, United States, 10 1999. Sharma, A., Li, H., Li, X., and Jiao, J. Optimizing novelty of top-k recommendations using large language models and reinforcement learning. ar Xiv preprint ar Xiv:2406.14169, 2024. Yao, S. and Huang, B. Beyond parity: Fairness objectives for collaborative filtering, 2017. Zhang, S., Yao, L., Sun, A., and Tay, Y. Deep learning based recommender system: A survey and new perspectives. ACM computing surveys (CSUR), 52(1):1 38, 2019. Zhu, H., Xiong, F., Chen, H., Xiong, X., and Wang, L. Incorporating a triple graph neural network with multiple implicit feedback for social recommendation. ACM Transactions on the Web, 2024. Ziegler, C.-N., Mc Nee, S. M., Konstan, J. A., and Lausen, G. Improving recommendation lists through topic diversification. In Proceedings of the 14th International Conference on World Wide Web, WWW 05. Association for Computing Machinery, 2005. Zou, X. and Liu, W. Generalization bounds for adversarial contrastive learning, 2023. URL https://arxiv. org/abs/2302.10633. A. Assumptions Assumption A.1. In theorem 4.2, we assume that the groups G1 and G2 are mutually independent and that hit rates or DCG scores are independently distributed within each group. Under these assumptions, the variances for the fairness metrics are calculated as follows: σ2 hit = ˆp1(1 ˆp1) n1 + ˆp2(1 ˆp2) σ2 DCG = s2 1 n1 + s2 2 n2 , where ˆp1 and ˆp2 are the observed hit rates, and s2 1 and s2 2 are the sample variances of DCG scores for groups G1 and G2, respectively. This assumption ensures that the application of Bernstein s inequality is valid, allowing us to derive the Upper Confidence Bound (UCB) for fairness metrics as shown in 14. Assumption A.2. Throughout the Theorem 5.1, we make a mild assumption on λmin, i.e., the minimum value the parameter λ can take, as follows: Pr(RG(λmin G ) α) 1 δ Pr( F(λmin G1 , λmin G2 ) η) 1 δ where λmin G is the group-specific minimum value of the parameter for risk control, and λmin G1 and λmin G2 are the minimum values for fairness control across the groups. This assumption depicts the belief that we can control any user-defined risk α and fairness ϵ by taking valid λ values in a closed set Λ R2 { }. B.1. Proof of Theorem 4.1 Proof. We focus on finding some ˆR+ G such that out of n samples, ˆRG yields atmost k = n ˆ RG successes (where success is defined as observing a risk) with a significance level of atleast 1 δ. The CDF of the binomial distribution is given by: P(Binom(n, p) k) = pi(1 p)n i. Let us assume we know ˆR+ G and we seek ˆ RG such that: P(Binom(n, ˆR+ G) n ˆ RG) 1 δ. Replacing ˆR+ G with the user-defined risk value α, the equation becomes: P(Binom(n, α) n ˆ RG) 1 δ ENSUR: Equitable and Statistically Unbiased Recommendation or P(Binom(n, α) n ˆ RG) δ which can be reformulated as: Binom CDF(n ˆRG, n, α) δ. To solve for ˆ RG, we find the root of this equation which is also the UCB at α: i.e., Binom CDF(n ˆ RG, n, α) δ = 0 ˆR+ G = sup n ˆ RG : Binom CDF(n ˆ RG, n, α) δ o Hence Proved. B.2. Proof of Theorem 4.2 Proof. Bernstein s inequality for a sum of independent random variables Xi with mean µ, variance σ2, and bounded by U states: (17) where n is the number of observations, Xi is the i-th random variable, t is the deviation threshold, σ2 is the variance of Xi, and U is the upper bound on the range of Xi. Analogously, we consider, with some decision-maker confidence value ˆδ, that the empirical fairness metric differs from the true fairness metric by the threshold t. This can be mathematically represented as: which rearranges to: solving for t gives: v u u t2σ2 F log 2 ˆδ 3U log 2 ˆδ Assuming U = 1 conservatively and n = n1 + n2, we obtain the UCB as: F +(λG1, λG2, ˆδ) = F(λG1, λG2) 2σ2 F log( 2 Hence Proved. B.3. Proof of Theorem 5.1 Proof. Let λ G be the highest parameter value for each group G {G1, G2} such that the expected risk of not including truly relevant items and the fairness metric is less than α and η respectively, i.e., λ G = max{λG [λmin,G, λmax,G] : RG(λG) α FG(λG1, λG2) η} (18) Assume for a parameter value ˆλG, we have RG(ˆλG) > α or F(ˆλG1, ˆλG2) > α. Then by the definition of λ G, we have, RG(λ G) α F(λ G1, λ G2) η which implies, RG(λ G) α < RG(ˆλG) F(λ G1, λ G2) η < F(ˆλG1, ˆλG2) Using Equation (3), we have: Since ˆλG and λ G are within the range of real numbers, consider some ξ > 0 such that (λ G + ξ) ˆλG, Utilizing the definition of λ G and ˆλG in Equation 18, we get, R+ G(λ G + ξ, δ) α < RG(λ G + ξ) F +(λ G1, λ G2 + ξ, δ) η < F(λ G1, λ G2 + ξ) (19) According to the principles of Upper Confidence Bound (UCB), i.e., eq. 11, the events R+ G(λ G + ξ, δ) α or F +(λ G1, λ G2 + ξ, ˆδ) η can only occur with probabilities not exceeding δ and ˆδ respectively. Specifically, the UCB ensures that the probability of observing RG(ˆλG) > α is bounded by δ, or the probability of F(ˆλG1, ˆλG2) > η is bounded by ˆδ. Therefore, with complementary probability condition, under Assumption 1 and the defined ranges of δ and ˆδ, we can conclude with confidence that: Pr(RG(ˆλG) α) 1 δ Pr( F(ˆλG1, ˆλG2) η) 1 ˆδ. (20) This validates the assertions of Theorem 3, thereby formally proving the theorem. ENSUR: Equitable and Statistically Unbiased Recommendation B.4. Proof of Theorem 5.2 Proof. Since RG(ϕλ ,G) RG(ϕˆλG) and F(ϕλG1,λ G2) F(ϕ ˆ λG1,λG2), this relationship is expressed through the sum of relevance scores m(u, i) over the items in the respective prediction sets for users: i ϕλ G(u) m(u, i) X i ϕˆλG(u) m(u, i), indicating that the accumulated scores of included items in ϕλ G are greater. This is equivalent to: i ϕλ G(u)\ϕˆλG(u) m(u, i) X i ϕˆλG(u)\ϕλ G(u) m(u, i). For some items i ϕλ G(u) \ ϕˆλG(u), m(u, i) < ˆλG, and for all items i ϕˆλG(u) \ ϕλ G(u), m(u, i) ˆλG, based on Algorithm 1. This condition is satisfied if: |ϕλ G(u)| |ϕˆλG(u)|. Thus, the expected size of the set using ϕˆλG is optimized to be minimal, i.e., E h |ϕˆλG(u)| i E |ϕλ G(u)| , (21) thereby proving the theorem. C. Implementation Details All base recommender models are trained for 20 epochs with a batch size of 256, a learning rate of 0.001, the Adam optimizer, and Binary Cross Entropy Loss (BCELoss). For the NFCF and MFCF models, we modified the original code to generalize grouping logic for diverse criteria (e.g., interaction count, age, gender, and geography) and adapted the debiasing process to compute bias directions dynamically for various groups. To ensure consistency, we reused the score files generated by our base models for the GMF-UFR and Neu MF-UFR models. In order to enhance the reproducibility of the results, we utilized MIP (Santos & Toffolo, 2020), a free light-weight Python library for modeling and optimization, instead of Gurobi (Gurobi Optimization, LLC, 2024) optimization solver, a commercially licensed software. For fair and sound comparisons with the base models and fairness baselines, instead of using arbitrary top-k predictions, we utilized the average optimal prediction set size returned by the ENSUR framework on top of the given base recommendation model. D. Detailed Experimenation Details D.1. Datasets and Grouping Methods In the main paper, we introduced four user grouping strategies to evaluate the fairness and performance of our framework: (1) grouping based on interaction count with items, (2) grouping based on user age, (3) grouping based on user gender, and (4) grouping based on geographic categorization into developed and other countries. These strategies were applied to the Amazon Office, Book-Crossing, Movie Lens, and Last.fm datasets, respectively. Below, we provide further details on the grouping methodology: Grouping by interaction count: Following Li et al. (2021a), users were initially evenly split into two groups, with 50% assigned to each group. The groups were then dynamically adjusted to ensure that the minimum interaction count in the advantaged group exceeded the maximum count in the disadvantaged group by at least one. Grouping by age: Users were divided into two age groups: younger users ( 60 years) and older users (> 60 years). Grouping by gender: Users were grouped into binary categories based on identified gender (male and female). Geographic categorization: Users were categorized based on their country of origin into developed (e.g., USA, UK, Europe, Japan etc.) and other countries. Furthermore, we conducted an additional grouping experiment on the Last.fm dataset. We extended the interaction count-based grouping to incorporate interactions with popular items, following Abdollahpouri et al. (2019). The results of this experiment are provided in Appendix F. D.2. Sampling and Data Splitting We followed the following sampling and splitting method: Negative sampling: Following Ma et al. (2024), we selected 50 non-interacted items per user through negative sampling for training, validation, and testing. Data splitting: We employed the Leave-One-Out (LOO) strategy (He et al., 2017; Han et al., 2023) to partition the dataset into training, calibration, and testing sets. Specifically, for each user, one interaction was isolated for calibration and testing, while the remaining interactions were used for training. Multiple trials: To account for variability in sampling and splitting, we repeated the experiments over 20 independent trials. For each trial, random negative samples were drawn for training, validation, and testing. The results were averaged across all the trials. ENSUR: Equitable and Statistically Unbiased Recommendation D.3. Model Configurations and Fairness Baselines To evaluate the effectiveness of our framework, we implemented it on top of the five base recommender models specified in the main paper. Here, we provide specific architectural and training details of the models used: Base Recommendation Models Deep FM: Combines 8 latent factors with deep layers of [50, 25, 10] and Re LU activation. GMF: Utilizes an embedding size of 8 for capturing linear interactions between user and item embeddings. MLP: Employs layers of [64, 32, 16] with Re LU activation for modeling non-linear interactions. Neu MF: Integrates GMF and MLP with a GMF embedding size of 8 and MLP layers of [64, 32, 16], using Re LU activation. Light GCN: Configured with an embedding size of 8 and 3 graph convolution layers. To validate our framework further, we compared it with four fairness baseline approaches. The baselines are based on the most commonly adopted methods in fairness literature i.e. in-processing and post-processing methods (Li et al., 2023) : Fairness Baselines NFCF and MFCF(In-processing) (Islam et al., 2021): The authors utilize a pre-training and finetuning approach to induce user-sided group fairness. Initially, the user embeddings are learned from nonsensitive interactions, followed by a de-biasing step to mitigate the embedding bias. Finally, the models are fine-tuned on sensitive item recommendations with a fairness penalty to reduce systemic bias in predictions. Neumf-UFR AND GMF-UFR (Post-processing) (Li et al., 2021a): This post-hoc re-ranking approach utilizes an integer programming solver to balance fairness and utility disparity between advantaged and disadvantaged user groups. The method optimizes preference scores while enforcing a fairness constraint, ensuring that recommendation quality differences (e.g., DCG@10, F1@10) between groups remain below a specified threshold. E. Additional Experiments E.1. Remaining Experiments -Continued Tables 3 to 5 extend the analysis provided in the main paper. These tables support the key findings: the ENSUR framework consistently achieves both risk control (α = 0.20) and fairness (η = 0.20) thresholds across all datasets, outperforming base models and fairness baselines. These results reaffirm the main paper s observations regarding ENSUR s ability to balance fairness and performance while adapting effectively across diverse datasets. Table 3. Performances and fairness comparisons with base models and fairness baselines on the Movie Lens Dataset grouped by the gender in terms of risk, average set size, and Hit Rate Diff/DCG Diff, respectively. Bold indicates the best result, underline indicates the second best and marks threshold exceeded cases. Method Group Risk Average Set Size Hit Rate DCG Hit Rate Diff DCG Diff Deep FM 1 0.2 0.8 0.503 0.017 0.022 2 0.183 0.817 0.525 Deep FM + ENSUR 1 0.188 0.812 0.504 0.002 0.018 2 0.187 0.813 0.522 GMF 1 0.147 0.853 0.538 0.051 0.019 2 0.198 0.802 0.519 GMF + ENSUR 1 0.155 0.845 0.526 0.043 0.008 2 0.198 0.802 0.517 Light GCN 1 0.212 0.788 0.432 0.077 0.043 2 0.289 0.711 0.389 Light GCN + ENSUR 1 0.128 0.873 0.47 0.001 0.031 2 0.128 0.872 0.44 MLP 1 0.173 0.827 0.553 0.016 0.007 2 0.158 0.842 0.56 MLP + ENSUR 1 0.151 0.849 0.557 0.014 0.004 2 0.165 0.835 0.553 Neu MF 1 0.199 0.802 0.542 0.002 0.015 2 0.198 0.802 0.557 Neu MF + ENSUR 1 0.149 0.851 0.556 0.05 0.005 2 0.199 0.801 0.551 Other Fairness Baselines NFCF 1 0.198 8 0.802 0.539 0.01 0.01 2 0.205 0.795 0.549 MFCF 1 0.243 9 0.757 0.552 0.002 0.009 2 0.242 0.758 0.561 Neu MF-UFR 1 0.216 8 0.784 0.528 0.034 0.021 2 0.182 0.818 0.549 GMF-UFR 1 0.215 9 0.785 0.527 0.03 0.022 2 0.185 0.815 0.549 Table 4. Performances and fairness comparisons with base models and fairness baselines on the Last.f M Dataset grouped by the Region in terms of risk, average set size, and Hit Rate Diff/DCG Diff, respectively. Bold indicates the best result, underline indicates the second best and marks threshold exceeded cases. Method Group Risk Average Set Size Hit Rate DCG Hit Rate Diff DCG Diff Deep FM 1 0.171 0.829 0.363 0.108 0.111 2 0.279 0.721 0.252 Deep FM + ENSUR 1 0.181 0.819 0.358 0.016 0.016 2 0.197 0.803 0.342 GMF 1 0.186 0.814 0.268 0.107 0.071 2 0.293 0.707 0.197 GMF + ENSUR 1 0.156 0.844 0.273 0.019 0.023 2 0.175 0.825 0.25 Light GCN 1 0.217 0.783 0.382 0.026 0.013 2 0.243 0.757 0.369 Light GCN + ENSUR 1 0.164 0.836 0.392 0.03 0.02 2 0.194 0.806 0.39 MLP 1 0.221 0.779 0.328 0.019 0.013 2 0.24 0.76 0.315 MLP + ENSUR 1 0.197 0.803 0.331 0.007 0.008 2 0.19 0.81 0.323 Neu MF 1 0.201 0.799 0.323 0.068 0.021 2 0.269 0.731 0.302 Neu MF + ENSUR 1 0.187 0.813 0.330 0.011 0.004 2 0.198 0.802 0.326 Other Fairness Baselines NFCF 1 0.248 30 0.752 0.344 0.024 0.049 2 0.272 0.728 0.295 MFCF 1 0.231 45 0.769 0.269 0.066 0.051 2 0.297 0.703 0.218 Neu MF-UFR 1 0.213 30 0.787 0.306 0.045 0.019 2 0.258 0.742 0.287 GMF-UFR 1 0.211 45 0.789 0.245 0.067 0.048 2 0.278 0.722 0.197 ENSUR: Equitable and Statistically Unbiased Recommendation Table 5. Performances and fairness comparisons with base models and fairness baselines on the Book-Crossing Dataset grouped by the Age in terms of risk, average set size, and Hit Rate Diff/DCG Diff, respectively. Bold indicates the best result, underline indicates the second best and marks threshold exceeded cases. Method Group Risk Average Set Size Hit Rate DCG Hit Rate Diff DCG Diff Deep FM 1 0.123 0.873 0.291 0.302 0.115 2 0.429 0.571 0.176 Deep FM + ENSUR 1 0.188 0.812 0.251 0.003 0.02 2 0.191 0.809 0.231 GMF 1 0.187 0.813 0.277 0.129 0.115 2 0.316 0.684 0.162 GMF + ENSUR 1 0.185 0.815 0.268 0.116 0.05 2 0.199 0.801 0.232 Light GCN 1 0.154 0.846 0.189 0.217 0.031 2 0.371 0.629 0.158 Light GCN + ENSUR 1 0.18 0.82 0.186 0.019 0.025 2 0.199 0.801 0.161 MLP 1 0.124 0.876 0.225 0.192 0.093 2 0.316 0.684 0.132 MLP + ENSUR 1 0.167 0.833 0.194 0.029 0.019 2 0.196 0.804 0.175 Neu MF 1 0.145 0.855 0.253 0.214 0.107 2 0.359 0.641 0.146 Neu MF + ENSUR 1 0.187 0.813 0.227 0.004 0.036 2 0.191 0.809 0.204 Other Fairness Baselines NFCF 1 0.216 39 0.784 0.264 0.095 0.08 2 0.311 0.689 0.184 MFCF 1 0.248 35 0.752 0.252 0.087 0.074 2 0.335 0.665 0.178 Neu MF-UFR 1 0.183 39 0.817 0.236 0.143 0.041 2 0.326 0.674 0.195 GMF-UFR 1 0.195 35 0.805 0.265 0.118 0.097 2 0.313 0.687 0.168 E.2. Parameters Analysis -Continued Effect of Risk Control Parameters α and δ on Prediction Set Sizes. Figures 6 to 8 illustrate the trends in the average prediction set size as α varies from 0.10 to 0.50 (in increments of 0.05), while keeping the risk confidence thresholds fixed at δ = 0.05, 0.10, 0.15, using the Book-Crossing, Movie Lens, and Last.fm datasets respectively. Similarly, Figures 9 to 11 present the trends in the average prediction set size as δ varies from 0.10 to 0.50 (in increments of 0.05), while keeping the confidence thresholds fixed at α = 0.15, 0.20, 0.25, using the Amazon Office, Movie Lens, and Last.fm datasets respectively. The observed trends in Figures 6 to 8 (variation in α) and Figures 9 to 11 (variation in δ) are consistent with the observations reported in Figure 2 (Amazon Office dataset) and Figure 3 (Book-Crossing dataset) in the main paper. These results reinforce the consistency of our framework s behavior across different datasets and grouping methods. Effect of Fairness Control Parameters η and ˆδ on Prediction Set Sizes Figures 12 to 14 illustrate how the average prediction set size changes as η varies from 0.10 to 0.50 (in increments of 0.05), while holding the fairness confidence thresholds fixed at ˆδ = 0.15, 0.20, 0.25. These results are based on the Amazon Office dataset, Book-Crossing and Book-Crossing datasets respectively. In contrast, Figures 15 to 17 display the trends in prediction set size as ˆδ ranges from 0.10 to 0.50 (in increments of 0.05), with fixed thresholds of η = 0.15, 0.20, 0.25. These findings are based on Amazon Office, Movie Lens and Last.fm datasets respectively. These results further validate variations in η and ˆδ exhibit consistent patterns, emphasizing our framework s ability to adapt prediction set sizes effectively based on fairness constraints. F. Generalizablity of Grouping Methods We validate if ENSUR is adaptable to practitioners demands for customized user groups based on specific biases or fairness concerns relevant to their context. Specifically, we test our framework using different grouping techniques on a single dataset i.e. Last.fm by grouping users based on item interactions and grouping by both item interactions and interactions with popular items on the Last.fm dataset. The results could be found in Table 6 and Table 7. The results demonstrate that the ENSUR framework can dynamically generate prediction sets for users grouped by any condition. This is particularly useful in real-world scenarios, where different applications may have different definitions of fairness. By allowing any grouping method, the framework can support dynamic fairness criteria that can evolve with changing societal norms or organizational policies, thereby allowing practitioners to define user groups based on the specific biases or fairness concerns relevant to their context. G. Practical Applicability of the Framework We now analyze the practical applicability of our framework. In real-world recommendation systems, prediction sets are often fixed to a specific size k and applied uniformly across all users. This fixed size is typically determined heuristically or through trial and error, aiming to maximize the likelihood of including items that users may interact with while prioritizing and ranking items by relevance. However, this heuristic approach has several limitations: Fixed-size sets can lead to cognitive overload for users when the size is too large or fail to meet individual user needs when the size is too small. They do not account for disparities in user engagement or group fairness, potentially disadvantaging certain user groups. Recommending unnecessary items results in resource inefficiencies for platforms. Our framework addresses these challenges by dynamically determining the minimum prediction set size for each user, satisfying fairness and performance guarantees with statistical confidence (e.g., 95%). This complements the existing recommender systems as we can employ an appropriate ag- ENSUR: Equitable and Statistically Unbiased Recommendation (a) δ = 0.05 (b) δ = 0.10 (c) δ = 0.15 Figure 6. Analysis of base models after applying the ENSUR framework in terms of average set size with varying α = {0.10, 0.15, 0.20, 0.25, 0.30, 0.35, 0.40, 0.45, 0.50} on Last.fm dataset grouped by Region under different δ. (a) δ = 0.05 (b) δ = 0.10 (c) δ = 0.15 Figure 7. Analysis of base models after applying the ENSUR framework in terms of average set size with varying α = {0.10, 0.15, 0.20, 0.25, 0.30, 0.35, 0.40, 0.45, 0.50} on Book-Crossing dataset grouped by Age under different δ. (a) δ = 0.05 (b) δ = 0.10 (c) δ = 0.15 Figure 8. Analysis of base models after applying the ENSUR framework in terms of average set size with varying α = {0.10, 0.15, 0.20, 0.25, 0.30, 0.35, 0.40, 0.45, 0.50} on Movie Lens dataset grouped by Gender under different δ. gregation method (for example, mean) to compute global k. This global k, obtained with the theoretical guarantees, can then be applied to recommend unseen items to users, ensuring that fairness and performance guarantees hold across the system. For example, in e-commerce platforms such as Amazon, instead of heuristically fixing k = 10 for all users, our framework identifies an optimal k (e.g., k = 7) that balances fairness and accuracy, reducing unnecessary recommendations and enhancing user satisfaction while optimizing platform resources. Similarly, in streaming services like Netflix, dynamically adjusting k in cold-start scenarios ensures concise and personalized recommendations, preventing user overwhelm and aligning with platform resource constraints. Additionally, the calculated k can serve as a benchmark to fine-tune recommendation models, enabling iterative improvements that enhance fairness and accuracy across diverse user groups. By tailoring prediction set sizes dynamically, our framework provides a practical, scalable solution for modern recommendation systems. ENSUR: Equitable and Statistically Unbiased Recommendation (a) α = 0.15 (b) α = 0.20 (c) α = 0.25 Figure 9. Analysis of base models after applying the ENSUR framework in terms of average set size with varying δ = {0.10, 0.15, 0.20, 0.25, 0.30, 0.35, 0.40, 0.45, 0.50} on Amazon Office dataset grouped by Interactions under different α. (a) α = 0.15 (b) α = 0.20 (c) α = 0.25 Figure 10. Analysis of base models after applying the ENSUR framework in terms of average set size with varying δ = {0.10, 0.15, 0.20, 0.25, 0.30, 0.35, 0.40, 0.45, 0.50} on Last.fm dataset grouped by Region under different α. (a) α = 0.15 (b) α = 0.20 (c) α = 0.25 Figure 11. Analysis of base models after applying the ENSUR framework in terms of average set size with varying δ = {0.10, 0.15, 0.20, 0.25, 0.30, 0.35, 0.40, 0.45, 0.50} on Movie Lens dataset grouped by Gender under different α. (a) ˆδ = 0.15 (b) ˆδ = 0.20 (c) ˆδ = 0.25 Figure 12. Analysis of base models after applying the ENSUR framework in terms of average set size with varying η = {0.10, 0.15, 0.20, 0.25, 0.30, 0.35, 0.40, 0.45, 0.50} on Amazon Office dataset grouped by Interactions under different ˆδ. ENSUR: Equitable and Statistically Unbiased Recommendation (a) ˆδ = 0.15 (b) ˆδ = 0.20 (c) ˆδ = 0.25 Figure 13. Analysis of base models after applying the ENSUR framework in terms of average set size with varying η = {0.10, 0.15, 0.20, 0.25, 0.30, 0.35, 0.40, 0.45, 0.50} on Book-Crossing dataset grouped by Age under different ˆδ. (a) ˆδ = 0.15 (b) ˆδ = 0.20 (c) ˆδ = 0.25 Figure 14. Analysis of base models after applying the ENSUR framework in terms of average set size with varying η = {0.10, 0.15, 0.20, 0.25, 0.30, 0.35, 0.40, 0.45, 0.50} on Last.f M dataset grouped by Region under different ˆδ. (a) η = 0.15 (b) η = 0.20 (c) η = 0.25 Figure 15. Analysis of base models after applying the ENSUR framework in terms of average set size with varying ˆδ = {0.10, 0.15, 0.20, 0.25, 0.30, 0.35, 0.40, 0.45, 0.50} on Amazon Office dataset grouped by Interactions under different η. (a) η = 0.15 (b) η = 0.20 (c) η = 0.25 Figure 16. Analysis of base models after applying the ENSUR framework in terms of average set size with varying ˆδ = {0.10, 0.15, 0.20, 0.25, 0.30, 0.35, 0.40, 0.45, 0.50} on Book-Crossing dataset grouped by Age under different η. ENSUR: Equitable and Statistically Unbiased Recommendation (a) η = 0.15 (b) η = 0.20 (c) η = 0.25 Figure 17. Analysis of base models after applying the ENSUR framework in terms of average set size with varying ˆδ = {0.10, 0.15, 0.20, 0.25, 0.30, 0.35, 0.40, 0.45, 0.50} on Movie Lens dataset grouped by Gender under different η. ENSUR: Equitable and Statistically Unbiased Recommendation Table 6. Performance and fairness comparisons with base models and fairness baselines on the Last.f M Dataset grouped by the Item Interactions in terms of risk, average set size, and Hit Rate Diff/DCG Diff, respectively. Bold indicates the best result, underline indicates the second best and marks threshold exceeded cases. Method Group Risk Average Set Size Hit Rate DCG Hit Rate Diff DCG Diff Grouped by number of interactions Deep FM 1 0.183 0.817 0.494 0.073 0.035 2 0.255 0.745 0.458 Deep FM + ENSUR 1 0.177 0.823 0.494 0.013 0.022 2 0.19 0.81 0.472 GMF 1 0.163 0.837 0.567 0.08 0.066 2 0.243 0.757 0.501 GMF + ENSUR 1 0.183 0.817 0.56 0.001 0.045 2 0.183 0.817 0.515 Light GCN 1 0.179 0.821 0.547 0.18 0.089 2 0.359 0.641 0.458 Light GCN + ENSUR 1 0.201 0.799 0.492 0.003 0.066 2 0.198 0.802 0.426 MLP 1 0.192 0.808 0.44 0.077 0.076 2 0.269 0.731 0.364 MLP + ENSUR 1 0.151 0.849 0.448 0.041 0.067 2 0.192 0.808 0.38 Neu MF 1 0.151 0.849 0.58 0.087 0.081 2 0.238 0.762 0.499 Neu MF + ENSUR 1 0.142 0.858 0.581 0.027 0.067 2 0.169 0.831 0.513 Other Fairness Baselines NFCF 1 0.248 16 0.822 0.569 0.039 0.053 2 0.272 0.783 0.516 MFCF 1 0.231 13 0.815 0.529 0.042 0.021 2 0.297 0.773 0.508 Neu MF-UFR 1 0.213 16 0.827 0.546 0.045 0.031 2 0.258 0.782 0.515 GMF-UFR 1 0.211 13 0.819 0.536 0.047 0.029 2 0.278 0.772 0.517 Table 7. Performance, and fairness comparisons with base models and fairness baselines on the Last.f M Dataset grouped by the Item Interactions & Interaction with Popular Items in terms of risk, average set size, and Hit Rate Diff/DCG Diff, respectively. Bold indicates the best result, underline indicates the second best and marks threshold exceeded cases. Method Group Risk Average Set Size Hit Rate DCG Hit Rate Diff DCG Diff Grouped by number of total interactions & popular items interactions Deep FM 1 0.078 0.922 0.56 0.22 0.226 2 0.298 0.702 0.334 Deep FM + ENSUR 1 0.162 0.838 0.54 0.162 0.151 2 0 1 0.389 GMF 1 0.063 0.937 0.603 0.112 0.231 2 0.176 0.824 0.372 GMF + ENSUR 1 0.163 0.837 0.578 0.163 0.174 2 0 1 0.405 Light GCN 1 0.076 0.924 0.627 0.119 0.24 2 0.195 0.805 0.387 Light GCN + ENSUR 1 0.126 0.874 0.585 0.03 0.106 2 0.156 0.844 0.479 MLP 1 0.057 0.943 0.522 0.152 0.239 2 0.209 0.791 0.283 MLP + ENSUR 1 0.143 0.857 0.5 0.143 0.179 2 0 1 0.322 Neu MF 1 0.07 0.93 0.661 0.147 0.263 2 0.217 0.783 0.398 Neu MF + ENSUR 1 0.154 0.846 0.64 0.154 0.198 2 0 1 0.442 Other Fairness Baselines NFCF 1 0.115 29 0.885 0.629 0.08 0.208 2 0.195 0.805 0.421 MFCF 1 0.137 30 0.863 0.549 0.109 0.151 2 0.243 0.757 0.44 Neu MF-UFR 1 0.111 29 0.889 0.588 0.077 0.166 2 0.188 0.812 0.428 GMF-UFR 1 0.121 30 0.879 0.566 0.057 0.198 2 0.178 0.822 0.368