# aequitas_flow_streamlining_fair_ml_experimentation__222f9b8b.pdf

Journal of Machine Learning Research 25 (2024) 1-7 Submitted 5/24; Revised 9/24; Published 10/24

Aequitas Flow: Streamlining Fair ML Experimentation

Sérgio Jesus1,2 sergio.jesus@feedzai.com Pedro Saleiro1 pedro.saleiro@feedzai.com Inês Oliveira e Silva1 ines.silva@feedzai.com Beatriz M. Jorge1 beatriz.jorge@feedzai.com Rita P. Ribeiro2 rpribeiro@fc.up.pt João Gama2 jgama@fep.up.pt Pedro Bizarro1 pedro.bizarro@feedzai.com Rayid Ghani3 rayid@cmu.edu 1Feedzai 2University of Porto 3Carnegie Mellon University

Editor: Sebastian Schelter

Aequitas Flow is an open-source framework and toolkit for end-to-end Fair Machine Learning (ML) experimentation, and benchmarking in Python. This package ﬁlls integration gaps that exist in other fair ML packages. In addition to the existing audit capabilities in Aequitas, the Aequitas Flow module provides a pipeline for fairness-aware model training, hyperparameter optimization, and evaluation, enabling easy-to-use and rapid experiments and analysis of results. Aimed at ML practitioners and researchers, the framework oﬀers implementations of methods, datasets, metrics, and standard interfaces for these components to improve extensibility. By facilitating the development of fair ML practices, Aequitas Flow hopes to enhance the incorporation of fairness concepts in AI systems making AI systems more robust and fair. Keywords: Fair machine learning, experimentation, ethical artiﬁcial intelligence, open-source framework, python

1. Introduction

Developing Machine Learning (ML) and Artiﬁcial Intelligence (AI) systems that result in fairness and equity is a critical topic, especially as such systems get used in high-stakes settings such as hiring (Dastin, 2018), healthcare (Igoe, 2021), criminal justice (Angwin et al., 2016; Chouldechova, 2017), and ﬁnancial services (Zhang and Zhou, 2019; Bartlett et al., 2019; Jesus et al., 2022). While numerous studies deﬁne metrics and properties of algorithmic fairness (Chouldechova, 2017; Calders and Verwer, 2010; Dwork et al., 2012; Feldman et al., 2015; Hardt et al., 2016; Corbett-Davies et al., 2017) and propose methods for fairer models (Fish et al., 2016; Calmon et al., 2017; Zafar et al., 2017; Cotter et al., 2019), gaps in the implementation, user experience. and integration of existing tools hinder end-to-end experimentation (Lee and Singh, 2021) and benchmarking. This makes empirical studies and practical use challenging, scarce, and often limited in scope (Friedler et al., 2019; Lamba et al., 2021), ultimately aﬀecting the adoption of fair ML methods in real-world high-stakes settings. This paper introduces Aequitas Flow, an open-source framework for reproducible and extensible end-to-end fair ML experimentation that extends Aequitas, our original bias audit toolkit. The goal is to help 1)researchers compare and benchmark new methods they develop against existing methods in a systematic and reproducible manner and 2) practitioners easily evaluate existing bias mitigation methods and deploy ones that best match their goals.

c 2024 Sérgio Jesus, Pedro Saleiro, Inês Oliveira e Silva, Beatriz M. Jorge, Rita P. Ribeiro, João Gama, Pedro Bizarro, Rayid Ghani.

License: CC-BY 4.0, see https://creativecommons.org/licenses/by/4.0/. Attribution requirements are provided at http://jmlr.org/papers/v25/24-0677.html.

Jesus, Saleiro, Silva, Jorge, Ribeiro, Gama, Bizarro, Ghani

Table 1: Comparison of packages for training and evaluation of fair ML Methods.

Packages Functionalities AIF360 Fairlearn Aequitas Aequitas Flow

Group fairness metrics G# G# Pre-processing methods G# - In-processing methods G# - Post-processing methods - Standardized interfaces for extensibility G# G# - Hyperparameter optimization pipeline - - - Binary classiﬁcation Regression - - Model selection - - G# Methods comparison - - - Plotting methods - G#

exists in package; G # partially exists in package; - does not exist in package.

Table 1 compares Aequitas Flow, the latest release of the Aequitas package 1 (Saleiro et al., 2018) to other fair ML packages to highlight some of the key gaps we aimed to ﬁll with this paper. Fairlearn (Weerts et al., 2023), and AIF360 (Bellamy et al., 2018) are popular fair ML packages to facilitate adoption by oﬀering methods (Feldman et al., 2015; Hardt et al., 2016; Agarwal et al., 2018), fairness metrics (Hardt et al., 2016) and datasets available in the literature (Kohavi, 1996; Angwin et al., 2016; Dua and Graﬀ, 2017; Ding et al., 2021). However, some issues hinder their usability as standard toolkits for fairness studies. First, both lack a deﬁned experimentation pipeline, requiring users to opt for external packages for fundamental tasks, such as dataset splitting and hyperparameter optimization (Schelter et al., 2019). Second, inconsistency in class behavior and implementation force users to customize the code depending on the methods used. For instance, in AIF360 s Disparate Impact Remover class, most of the parent class methods are not implemented. These issues create a high barrier for users to eﬀectively use the packages. Our work tackles the lack of standardized tools for experimentation with fair ML, with an emphasis on the extensibility of methods, datasets, and metrics, the reproducibility of the experiments, as well as diﬀerent levels of customization for diﬀerent user needs.

2. Aequitas Flow

The Aequitas Flow package is a comprehensive framework that integrates the necessary elements for a complete fair ML experiment. These are methods, datasets, and optimization strategies. They can be accessed through a standardized pipeline deﬁned by conﬁguration ﬁles, Python dictionaries or instantiated independently. This provides a standardized platform for experimental fairness testing. Figure 1 represents the pipeline s structure, mapping the components and interactions and oﬀering an overview of the fairness experimentation process.

Experiment: The Experiment is the main component that orchestrates the workﬂow within the package. It processes input conﬁgurations, which can be provided as either ﬁles2 or Python dictionaries. These specify the methods, datasets, and optimization parameters. The Experiment component initializes and populates the necessary classes, ensuring they interact deterministically throughout the execution process. When an experiment is completed, the results can be analyzed directly within the class with the appropriate methods. A variant of this component allows for

1. https://github.com/dssg/aequitas 2. Examples provided in the repository.

Aequitas Flow: Streamlining Fair ML Experimentation

Parse conﬁgurations Instantiate components Handle interactions Save results

Requires user input

Conﬁgurations

Practitioner

Method Comparison

Model Selection

Dataset Dataset Dataset

Dataset Dataset Method

Pre-Processing

Transform data

In-Processing

Post-Processing

Transform scores Method

Pre-Processing

Transform data

In-Processing

Post-Processing

Transform scores

Load data Create splits

Pre-Processing Transform data

In-Processing

Post-Processing Transform scores

Selects hyperparameters

Calculate fairness metrics Calculate perf. metrics

Figure 1: Diagram of an Experiment in Aequitas Flow. The user input is passed to the Experiment, which will instantiate the components (Methods, Datasets, and Optimizer) in the pipeline. For each target task (for a researcher or practitioner), diﬀerent plotting methods can be used to analyze the experimental results.

simpliﬁed usage as it only requires the deﬁnition of a dataset. This feature is designed to streamline initial experiments and reduce conﬁgurations.

exp = Experiment(config_file="configs/experiment.yaml") exp.run()

Optimizer: The Optimizer component manages hyperparameter selection and model evaluation. It receives the hyperparameter search space of the methods and a split dataset to conduct hyperparameter tuning. It evaluates the performance of models, and stores the resulting artifacts. The component uses Optuna (Akiba et al., 2019) for hyperparameter selection and the bias auditing functionality of Aequitas (Saleiro et al., 2018) for fairness and performance evaluation. This component should only be instantiated by an Experiment, to guarantee consistency in input arguments. Several attributes of the hyperparameter optimization can be determined by conﬁgurations, such as the number of trials and jobs, the selection algorithm (e.g., random search, grid search), and the random seed.

Datasets: This component has two primary functions: loading the data and generating splits. It maintains information about the prediction target, typing, and sensitive features. The data is stored in a pandas dataframe format (pandas development team, 2020). The framework initially encompasses eleven tabular datasets, including those from the Bank Account Fraud (Jesus et al., 2022) and Folktables (Ding et al., 2021). The component also permits user-supplied datasets in CSV or parquet formats with splits based on a column, or randomly.

dataset = datasets.Folk Tables(variant= ACSIncome ) dataset.load_data () dataset.create_splits () dataset.train.X # return the train feature matrix

Methods: This group of components handles data processing and creates and adjusts predictions for validation and test sets. Aequitas Flow provides interfaces for the three recognized types of fair ML methods (Caton and Haas, 2023; Mehrabi et al., 2021; Pessach and Shmueli, 2022): pre-processing, in-processing, and post-processing. Pre-processing methods modify the input data, in-processing methods typically directly modify the objective function and generate prediction scores, and postprocessing methods adjust these scores or rankings. Additionally, ML classiﬁcation methods are included in the category of base estimators and function similarly to in-processing methods. The methods adhere to a standardized interface to facilitate calls within the experiment class. In the current version of Aequitas, 15 methods are supported.

Jesus, Saleiro, Silva, Jorge, Ribeiro, Gama, Bizarro, Ghani

model = methods.inprocessing.Fair GBM () model.fit(train.X, train.y, train.s) preds = model.predict_proba(val.X, val.s)

Audit: The Aequitas toolkit oﬀers a suite of metrics based on the confusion matrix for the protected groups in the dataset. Users may specify a group as a reference for comparison and select the appropriate fairness metric for their analysis. Experiments leverage the Audit class to calculate metrics and disparities when analyzing the produced prediction scores of a model.

audit_df = pd.Data Frame ({"score": preds , "label": val.y, "group": val.s}) audit = Audit(audit_df) audit.performance () # Obtain performance metrics audit.audit () # Obtain fairness metrics

Plotting: Aequitas Flow provides two workﬂows based on the goal of the user. The ﬁrst is around model selection (a), where users can plot the trained models with the desired metrics of fairness and performance in each axis. The Pareto frontier is displayed, with the model with the best fairnessperformance trade-oﬀhighlighted. The second provides a comparison of methods (b). Conﬁdence intervals for the combined performance and fairness are calculated for each tested method in the trade-oﬀs of these metrics. Additional plotting methods are available for in-depth bias auditing. Figure 2 shows examples of both.

0.0 0.2 0.4 0.6 0.8 1.0 Alpha

* TPR + (1- ) * Pred. Eq.

(Pred. Eq.) (TPR)

Bank Account Fraud (Base)

Sensitive attribute: Age

Light GBM Fair GBM Oversampling Undersampling Thresholding

Figure 2: Plots introduced in Aequitas Flow. Plot (a) is designed for model selection; Plot (b) compares the diﬀerent tested methods.

3. Conclusion

Aequitas Flow is an open-source framework that makes end-to-end experimentation with fair ML easier through the use of customizable components, namely datasets, methods, metrics, and optimization algorithms. It enhances robustness and reproducibility by addressing the issues of ad-hoc and single-use setups in fair ML experimentation. This can lead to better benchmarking and adoption of fair ML techniques in real world settings. While initially focused on tabular datasets, the framework s ﬂexible interfaces allow adaptation to other data formats, and ongoing updates will incorporate additional implementations, in a welcoming environment to community contributions. Recognizing the challenges associated with responsibly using this framework in real-world applications, we aim to support the widespread adoption of fair ML methodologies and increase their societal impact.

Aequitas Flow: Streamlining Fair ML Experimentation

Alekh Agarwal, Alina Beygelzimer, Miroslav Dudik, John Langford, and Hanna Wallach. A reductions approach to fair classiﬁcation. In Jennifer Dy and Andreas Krause, editors, Proceedings of the 35th International Conference on Machine Learning, volume 80 of Proceedings of Machine Learning Research, pages 60 69. PMLR, 10 15 Jul 2018. URL https://proceedings.mlr.press/v80/ agarwal18a.html.

Takuya Akiba, Shotaro Sano, Toshihiko Yanase, Takeru Ohta, and Masanori Koyama. Optuna: A next-generation hyperparameter optimization framework. In Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, KDD 19, page 2623 2631, New York, NY, USA, 2019. Association for Computing Machinery. ISBN 9781450362016. doi: 10.1145/3292500.3330701. URL https://doi.org/10.1145/3292500.3330701.

Julia Angwin, JeﬀLarson, Surya Mattu, and Lauren Kirchner. Machine bias. In Ethics of data and analytics, pages 254 264. Auerbach Publications, 2016.

Robert Bartlett, Adair Morse, Richard Stanton, and Nancy Wallace. Consumer-Lending Discrimination in the Fin Tech Era. Technical report, National Bureau of Economic Research, 2019.

Rachel K. E. Bellamy, Kuntal Dey, Michael Hind, Samuel C. Hoﬀman, Stephanie Houde, Kalapriya Kannan, Pranay Lohia, Jacquelyn Martino, Sameep Mehta, Aleksandra Mojsilovic, Seema Nagar, Karthikeyan Natesan Ramamurthy, John Richards, Diptikalyan Saha, Prasanna Sattigeri, Moninder Singh, Kush R. Varshney, and Yunfeng Zhang. AI Fairness 360: An extensible toolkit for detecting, understanding, and mitigating unwanted algorithmic bias, October 2018. URL https: //arxiv.org/abs/1810.01943.

Toon Calders and Sicco Verwer. Three naive bayes approaches for discrimination-free classiﬁcation. Data Mining and Knowledge Discovery, 21(2):277 292, Sep 2010. ISSN 1573-756X. doi: 10.1007/ s10618-010-0190-x. URL https://doi.org/10.1007/s10618-010-0190-x.

Flavio P. Calmon, Dennis Wei, Bhanukiran Vinzamuri, Karthikeyan Natesan Ramamurthy, and Kush R. Varshney. Optimized pre-processing for discrimination prevention. In Proceedings of the 31st International Conference on Neural Information Processing Systems, NIPS 17, page 3995 4004, Red Hook, NY, USA, 2017. Curran Associates Inc. ISBN 9781510860964.

Simon Caton and Christian Haas. Fairness in machine learning: A survey. ACM Comput. Surv., aug 2023. ISSN 0360-0300. doi: 10.1145/3616865. URL https://doi.org/10.1145/3616865.

Alexandra Chouldechova. Fair Prediction with Disparate Impact: A Study of Bias in Recidivism Prediction Instruments. Big Data, 5(2):153 163, jun 2017. ISSN 2167-6461. doi: 10.1089/big.2016. 0047. URL http://www.liebertpub.com/doi/10.1089/big.2016.0047.

Sam Corbett-Davies, Emma Pierson, Avi Feller, Sharad Goel, and Aziz Huq. Algorithmic decision making and the cost of fairness. In Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD 17, page 797 806, New York, NY, USA, 2017. Association for Computing Machinery. ISBN 9781450348874. doi: 10.1145/3097983.3098095. URL https://doi.org/10.1145/3097983.3098095.

Andrew Cotter, Heinrich Jiang, and Karthik Sridharan. Two-player games for eﬃcient non-convex constrained optimization. In Aurélien Garivier and Satyen Kale, editors, Proceedings of the 30th International Conference on Algorithmic Learning Theory, volume 98 of Proceedings of Machine Learning Research, pages 300 332. PMLR, 22 24 Mar 2019. URL https://proceedings.mlr. press/v98/cotter19a.html.

Jesus, Saleiro, Silva, Jorge, Ribeiro, Gama, Bizarro, Ghani

Jeﬀrey Dastin. Amazon scraps secret ai recruiting tool that showed bias against women. In Ethics of data and analytics, pages 296 299. Auerbach Publications, 2018.

Frances Ding, Moritz Hardt, John Miller, and Ludwig Schmidt. Retiring adult: New datasets for fair machine learning. In M. Ranzato, A. Beygelzimer, Y. Dauphin, P.S. Liang, and J. Wortman Vaughan, editors, Advances in Neural Information Processing Systems, volume 34, pages 6478 6490. Curran Associates, Inc., 2021. URL https://proceedings.neurips.cc/paper_files/paper/ 2021/file/32e54441e6382a7fbacbbbaf3c450059-Paper.pdf.

Dheeru Dua and Casey Graﬀ. UCI machine learning repository, 2017. URL http://archive.ics. uci.edu/ml.

Cynthia Dwork, Moritz Hardt, Toniann Pitassi, Omer Reingold, and Richard Zemel. Fairness through awareness. In Proc. of the 3rd Innovations in Theoretical Computer Science Conf. on - ITCS 12, pages 214 226, New York, USA, 2012. ACM Press. ISBN 9781450311151. doi: 10.1145/2090236.2090255. URL http://arxiv.org/abs/1104.3913.

Michael Feldman, Sorelle A. Friedler, John Moeller, Carlos Scheidegger, and Suresh Venkatasubramanian. Certifying and removing disparate impact. In Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD 15, page 259 268, New York, NY, USA, 2015. Association for Computing Machinery. ISBN 9781450336642. doi: 10.1145/2783258.2783311. URL https://doi.org/10.1145/2783258.2783311.

Benjamin Fish, Jeremy Kun, and Ádám Dániel Lelkes. A conﬁdence-based approach for balancing fairness and accuracy. In Sanjay Chawla Venkatasubramanian and Wagner Meira Jr., editors, Proceedings of the 2016 SIAM International Conference on Data Mining, Miami, Florida, USA, May 5-7, 2016, pages 144 152. SIAM, 2016. doi: 10.1137/1.9781611974348.17. URL https: //doi.org/10.1137/1.9781611974348.17.

Sorelle A. Friedler, Carlos Scheidegger, Suresh Venkatasubramanian, Sonam Choudhary, Evan P. Hamilton, and Derek Roth. A comparative study of fairness-enhancing interventions in machine learning. In Proceedings of the Conference on Fairness, Accountability, and Transparency, FAT* 19, page 329 338, New York, NY, USA, 2019. Association for Computing Machinery. ISBN 9781450361255. doi: 10.1145/3287560.3287589. URL https://doi.org/10.1145/3287560. 3287589.

Moritz Hardt, Eric Price, and Nathan Srebro. Equality of opportunity in supervised learning. In Proceedings of the 30th International Conference on Neural Information Processing Systems, NIPS 16, page 3323 3331, Red Hook, NY, USA, 2016. Curran Associates Inc. ISBN 9781510838819.

K Igoe. Algorithmic bias in health care exacerbates social inequities how to prevent it. Executive and Continuing Professional Education, 2021.

Sérgio Jesus, José Pombal, Duarte Alves, André Ferreira Cruz, Pedro Saleiro, Rita P. Ribeiro, João Gama, and Pedro Bizarro. Turning the tables: Biased, imbalanced, dynamic tabular datasets for ML evaluation. In Neur IPS, 2022. URL http://papers.nips.cc/paper_files/paper/2022/ hash/d9696563856bd350e4e7ac5e5812f23c-Abstract-Datasets_and_Benchmarks.html.

Ron Kohavi. Scaling up the accuracy of naive-bayes classiﬁers: A decision-tree hybrid. In Proceedings of the Second International Conference on Knowledge Discovery and Data Mining, KDD 96, page 202 207. AAAI Press, 1996.

Hemank Lamba, Kit Rodolfa, and Rayid Ghani. An empirical comparison of bias reduction methods on real-world problems in high-stakes policy settings. ACM SIGKDD Explorations Newsletter, 23: 69 85, 05 2021. doi: 10.1145/3468507.3468518.

Aequitas Flow: Streamlining Fair ML Experimentation

Michelle Seng Ah Lee and Jat Singh. The landscape and gaps in open source fairness toolkits. In Proceedings of the 2021 CHI Conference on Human Factors in Computing Systems, CHI 21, New York, NY, USA, 2021. Association for Computing Machinery. ISBN 9781450380966. doi: 10.1145/3411764.3445261. URL https://doi.org/10.1145/3411764.3445261.

Ninareh Mehrabi, Fred Morstatter, Nripsuta Saxena, Kristina Lerman, and Aram Galstyan. A survey on bias and fairness in machine learning. ACM Comput. Surv., 54(6), jul 2021. ISSN 0360-0300. doi: 10.1145/3457607. URL https://doi.org/10.1145/3457607.

The pandas development team. pandas-dev/pandas: Pandas, February 2020. URL https://doi. org/10.5281/zenodo.3509134.

Dana Pessach and Erez Shmueli. A review on fairness in machine learning. ACM Comput. Surv., 55 (3), feb 2022. ISSN 0360-0300. doi: 10.1145/3494672. URL https://doi.org/10.1145/3494672.

Pedro Saleiro, Benedict Kuester, Abby Stevens, Ari Anisfeld, Loren Hinkson, Jesse London, and Rayid Ghani. Aequitas: A bias and fairness audit toolkit. ar Xiv preprint ar Xiv:1811.05577, 2018.

Sebastian Schelter, Yuxuan He, Jatin Khilnani, and Julia Stoyanovich. Fairprep: Promoting data to a ﬁrst-class citizen in studies on fairness-enhancing interventions. Ar Xiv, abs/1911.12587, 2019. URL https://api.semanticscholar.org/Corpus ID:208512964.

Hilde Weerts, Miroslav Dudík, Richard Edgar, Adrin Jalali, Roman Lutz, and Michael Madaio. Fairlearn: Assessing and improving fairness of ai systems, 2023. URL http://jmlr.org/papers/ v24/23-0389.html.

Muhammad Bilal Zafar, Isabel Valera, Manuel Gomez Rogriguez, and Krishna P. Gummadi. Fairness Constraints: Mechanisms for Fair Classiﬁcation. In Aarti Singh and Jerry Zhu, editors, Proceedings of the 20th International Conference on Artiﬁcial Intelligence and Statistics, volume 54 of Proceedings of Machine Learning Research, pages 962 970. PMLR, 20 22 Apr 2017. URL https://proceedings.mlr.press/v54/zafar17a.html.

Yukun Zhang and Longsheng Zhou. Fairness assessment for artiﬁcial intelligence in ﬁnancial industry. ar Xiv preprint ar Xiv:1912.07211, 2019.