# reproducibility_study_of_fairac__47e316d9.pdf

Published in Transactions on Machine Learning Research (06/2024)

Reproducibility study of Fair AC

Gijs de Jong gijs.de.jong@student.uva.nl University of Amsterdam

Macha J. Meijer macha.meijer@student.uva.nl University of Amsterdam

Derck W. E. Prinzhorn derck.prinzhorn@student.uva.nl University of Amsterdam

Harold Ruiter harold.ruiter@student.uva.nl University of Amsterdam

Reviewed on Open Review: https://openreview.net/forum?id=cc Di5jt SF7

This work aims to reproduce the findings of the paper "Fair Attribute Completion on Graph with Missing Attributes" written by Guo, Chu, and Li [1] by investigating the claims made in the paper. This paper suggests that the results of the original paper are reproducible and thus, the claims hold. However, the claim that Fair AC is a generic framework for many downstream tasks is very broad and could therefore only be partially tested. Moreover, we show that Fair AC is generalizable to various datasets and sensitive attributes and show evidence that the improvement in group fairness of the Fair AC framework does not come at the expense of individual fairness. Lastly, the codebase of Fair AC has been refactored and is now easily applicable for various datasets and models.

1 Introduction

In recent years, graphs are an increasingly common data structure used in real-world applications [2]. Graph neural networks (GNNs) have shown promising performance in a large range of tasks, such as community detection [3], node classification [4], link prediction [5] and graph classification [6]. However, there is a possibility that the graph data may be biased, for example due to under-representation of a specific group [7]. If the model is trained on this biased graph, the model may be unfair with respect to certain sensitive attributes such as demographic or gender [8, 9]. Aside from this, the node information of graph datasets often is incomplete, for example when a node was recently added to a network [10].

To address the problem of attribute completion and fairness in graph data, the Fair AC framework was proposed by Guo, Chu, and Li [1]. Fair AC is a fair attribute completion framework which can be applied to a dataset, after which the result can be used as input for many downstream tasks, such as node classification or link prediction.

The aim of this paper is to reproduce the original paper of Fair AC and extend on the original paper. In summary, the following contributions are made:

1. The original Fair AC paper is reproduced and the original claims are analysed.

2. The codebase of Fair AC is improved and more easily applicable to other datasets and models.

Equal contribution.

Published in Transactions on Machine Learning Research (06/2024)

3. The original paper is analyzed further by using new datasets, testing the framework on various sensitive attributes and evaluating the performance on individual fairness.

2 Scope of reproducibility

This study describes the reproducibility of Fair Attribute Completion on Graph with Missing Attributes by Guo, Chu, and Li [1]. In earlier research, various fair graph methods have been developed, such as Fairwalk [11] and fair Graph SAGE [12]. While previous research addressed fairness in graphs with partially incomplete attributes [7], Fair AC uniquely targets nodes with entirely missing attributes. This approach also diverges from methods like those in [7] by addressing both node feature unfairness and topological unfairness. In the context of fairness, feature unfairness refers to bias within a single node s embedding, which can be viewed as unfair when it contains sensitive information. Topological unfairness arises when aggregating neighbouring nodes introduces sensitive information in the final embedding [13]. The paper proposes Fair AC, a fair attribute completion model for graphs. For more details on this method, see Section 3.

The claims that the paper made are as follows:

Claim 1 A new framework, namely Fair AC, is proposed, which can be used for fair graph attribute completion and addresses both feature and topological unfairness in the resulting graph embeddings.

Claim 2 Fair AC is generic and can be used in many graph-based downstream tasks.

Claim 3 Fair AC is effective in eliminating unfairness while maintaining an accuracy comparable to other methods.

Claim 4 Fair AC is effective even if a large amount of the attributes are missing.

Claim 5 Adversarial learning is necessary to obtain a better performance on group fairness.

Beyond reproducing the original study, we extended our work to assess Fair AC s generalizability. This involved evaluating the framework across different datasets and varying sensitive attributes. Additionally, we conducted a more in-depth analysis of Fair AC s fairness capabilities, using a metric for individual fairness.

3 Methodology

The Fair AC implementation is openly accessible, but the baseline code for GCN and Fair GNN is not included in this repository. To address this, we integrated these separate codebases, which are also publicly available, into a unified framework. In the restructuring, we enhanced the codebase to support the use of various datasets and different sensitive attributes in combination with Fair AC. Furthermore, we expanded Fair AC s evaluation criteria by incorporating a measure of individual fairness, offering contrast to the original paper s focus on group fairness.

3.1 Model description

Fair AC is novel framework designed to ensure fairness in attribute completion for graph nodes, as depicted in Figure 1. This framework is composed of roughly three components.

1. Auto-encoder. The auto-encoder is utilized to generate node embeddings of complete nodes. This involves adversarial training with a sensitive classifier tasked with detecting the presence of sensitive information in embeddings, guiding the auto-encoder towards more fair feature representation.

2. Attention-based attribute completion. Fair AC employs an attention mechanism to determine the importance of neighbouring nodes when completing the target node s attributes. The final embedding of the target node is created through a weighted aggregation, with the weights assigned by the attention mechanism.

Published in Transactions on Machine Learning Research (06/2024)

3. Topological fairness. The sensitive classifier reappears in this component to evaluate topological fairness of the target node s embedding, which is completely based on neighbouring nodes. If sensitive information is present, it indicates that it originated from the aggregation of neighboring nodes, highlighting potential topological bias.

Figure 1: The Fair AC framework consists of mainly three components: An auto-encoder to generate embeddings, an attention-based mechanism for attribute completion and a sensitive classifier to apply adversarial learning [1].

Fair AC s loss function integrates these three components (detailed in Appendix A). A hyperparameter β is used to adjust the influence of the sensitive classifier on the overall loss. β is multiplied with the loss of the sensitive classifier in both the topological and the node embedding prediction parts. The loss function is defined as L = LF + LC + βLT , where LF represents the feature fairness component loss, defined as LF = Lae βLCs.

3.2 Datasets

The authors have made available csv files of the three real-world datasets used for their experiments. We follow the paper and reproduce their evaluations on multiple datasets for each method. We present the relevant datasets in detail in Table 5.

NBA dataset. This is an extended version of a Kaggle dataset1 containing performance statistics of approximately 400 NBA basketball players from the 2016-2017 season. It includes personal information such as age, nationality, gender and salary [7], with player interactions on Twitter defining the network relationships. The node label indicates whether a player s salary is above the median and nationality serves as the sensitive attribute, intended to be excluded from the embeddings.

Pokec datasets. Derived from Pokec, a Slovakian social network [14], the Pokec-z and Pokec-n datasets are adapted from Dai and Wang [7]. While the original Fair AC paper used region as the sensitive attribute for both Pokec datasets, our study also evaluates age and gender. Age is converted to a binary variable by indicating if a person is younger than 21 years or not, similar to the approach in the credit dataset [15].

Additional datasets, German Credit and Recidivism, were also used to evaluate Fair AC [15]. These datasets were selected for their different themes and small size, making attribute completion particularly useful due to the higher importance of each data point. The code to preprocess the dataset was adapted from Py GDebias2. For Fair AC training, the first 25% of each dataset was used, a limitation due to memory constraints of the machines used for the original paper. The subsequent GNN training utilized 50% of the data, with the remaining 25% reserved for evaluation. All data splits follow those used in the original paper.

1https://www.kaggle.com/datasets/noahgift/social-power-nba 2https://github.com/yushundong/Py GDebias/

Published in Transactions on Machine Learning Research (06/2024)

3.3 Hyperparameters

In order to reproduce the experiments of the original author as closely as possible, the hyperparameters used in the original paper were used when available. This means that the feature drop rate (α) is 0.3 unless mentioned otherwise. β is 1.0 for all datasets, except for pokec-n, where β is equal to 0.5. In case of missing hyperparameters, the parameters in the provided shell scripts were adapted. For training, 3000 epochs were used including 200 epochs used for autoencoder pretraining. An Adam optimizer was used with a learning rate of 0.001 and a weight decay of 1 10 5. The details of all hyperparameters can be found in Appendix G.

In the original paper, all experiments are run three times. We assumed that this was done on three different seeds. Since, in the available scripts, only one seed is stated, we ve decided to use the seeds 40, 41 and 42, per advice of the authors.

3.4 Evaluation metrics

Model performance is evaluated in terms of accuracy, AUC and fairness. When evaluating fairness, an important distiction can be made between two fairness types: group fairness and individual fairness. Group fairness measures equality between different protected groups, while individual fairness measures the similarity in classification outcome for similar individuals [16]. The original paper uses two important metrics to evaluate in terms of (group) fairness: statistical parity and equal opportunity. Additionally, we add an individual fairness metric, consistency. The label y denotes the ground-truth node label and the sensitive attribute s indicates a sensitive group. For example, we consider a binary node classification task and two sensitive groups s {0, 1}.

Statistical Parity. Statistical Parity Difference ( SP) [9] is used to capture the difference in model prediction outcomes for different sensitive groups:

SP = P(ˆy|s = 0) P(ˆy|s = 1) (1)

Equal Opportunity. Equal Opportunity Difference ( EO) [17] is used to capture the difference in true positive rates for different sensitive groups:

EO = P(ˆy = 1|s = 0, y = 1) P(ˆy = 1|s = 1, y = 1) (2)

Consistency. Consistency is used to capture the individual fairness by measuring the consistency of outcomes between individuals who are similar to each other [18]. The formal definition regarding a fairness similarity matrix SF is:

consistency = 1

j |yi ˆyj| SF ij P

j SF ij , i = j. (3)

3.5 Experimental setup and code

All specific experiments conducted are described below, together with a description of the training details and the code.

3.5.1 Model training

The original paper provided a brief overview of the training algorithm; for more details refer to [1]. Our analysis found that certain critical details were missing in the original description. In the first part, the data split, it was found that the AC model is trained on only 25% of the nodes in the dataset, and during the test on the downstream task, the full 100% of the graph is used to complete any missing attributes and create new embeddings for the downstream task. After the data is split, the training process is started. An interesting detail in this process is that the auto-encoder in the AC model is pre-trained for 200 iterations. This iteration count remains fixed across all datasets, meaning that for smaller datasets the auto-encoder is pre-trained using comparatively less data, which is the case in for example the NBA or German credit

Published in Transactions on Machine Learning Research (06/2024)

dataset, as described in Section 3.2. Furthermore, the auto-encoder and attribute completion model are optimized differently during pre-training than during the training iterations. During pre-training they are optimized using Lae + LC. This to be expected as the sensitive classifier is not being trained yet.

Evaluation of the model is done for the first time after 1000 epochs. Every 200 epochs, a GCN is trained on the embeddings generated by Fair AC. This GCN is evaluated and provides results for the performance of Fair AC. The final results reported are the results of the model with the best fairness performance that scores above the accuracy and AUC threshold that are set at the start of training. For all datasets, the thresholds were set to 0.65 and 0.69, respectively. The complete model training procedure for the Fair AC model is given in Appendix C.

The original paper published their code, including shell scripts to run specific experiments, based on the Fair GNN codebase3. However, the code was difficult to work with and contained several peculiarities that are not mentioned in the original work (as mentioned in Section 3.5.1). In order to run the code on additional datasets and downstream tasks, the Fair AC framework has been rewritten into an ergonomic library that can be used for any GNN without any significant changes to the code. The source code for this reproduction has been made available (See Appendix D for more details). In addition to creating the library, the implementation has been made faster, which makes the runtime about 33% faster.

3.5.3 Experiment reproducibility study

To accurately reproduce the original study, various experiments were conducted. Firstly, to verify claim 1, 2 and 3, the Fair AC framework was applied on the default GCN on the datasets NBA, pokec-z and pokec-n. The results of this were compared with the results of using only a GCN and using the Fair GNN method. Conducting experiments on different datasets provides evidence for claim 2 specifically. To verify claim 4, experiments on all three models and the Fair AC model without adversarial training were conducted with different attribute missing rates, and the results are compared. Similarly, for claim 5, the Fair AC model was trained with different β values, to see the effect of adversarial learning.

3.5.4 Experiments beyond original paper

To further analyze the Fair AC framework, additional experiments were conducted. To provide additional evidence for claim 2, the framework was tested on two extra datasets, as described in Section 3.2. Additionally, the framework was tested on different sensitive attributes, to provide further evidence for generalizability. Specifically, two additional experiments were run on the pokec-z dataset, with the sensitive attributes age and gender. Lastly, to test whether there is a trade-off between the group fairness provided by Fair AC and individual fairness, all experiments were tested on an additional metric, consistency, which measures the individual fairness. The consistency between the default GCN and Fair AC was compared to see the influence of Fair AC on individual fairness.

3.6 Computational requirements

All experiments were performed on 1 NVIDIA Titan RTX GPU. Training one Fair AC model takes about 30 minutes on this GPU, while training a (Fair)GNN model takes about 10 minutes. Therefore, reproducing all original experiments costs about 31 GPU hours. Roughly 15 GPU hours were used for performing the additional experiments. Thus, 46 GPU hours were used in total. More details about the GPU usage can be found in Appendix F.

3https://github.com/Enyan Dai/Fair GNN

Published in Transactions on Machine Learning Research (06/2024)

In this section, the results of the reproducibility study are described. This includes all experiments ran to verify the claims as described in Section 2. In addition to this, the results of the work done beyond the original study are described.

4.1 Results reproducibility study

In Table 1, the reproduced results of the original study are shown. Together with this, the performance on individual fairness is shown, which are analyzed in Section 4.2. In the original study, the main trends observed are that fair AC performs similarly to other methods on accuracy and AUC and outperforms all other alternatives on statistical parity and equal opportunity. The same trends can be observed in the results from the reproducibility study. The accuracy and AUC scores are similar, although not as good as Fair GNN. On statistical parity and equal opportunity, Fair AC outperforms all other methods on almost every dataset. The actual numbers of all methods are slightly different from the original paper. This could be due to the optimization method not being deterministic. These results verify claim 1 and 3, as listed in Section 2. Namely, Fair AC addresses unfairness better than other methods while it maintains a comparable accuracy.

Dataset Method M Acc AUC SP EO SP+ EO Consistency

GCN 66.98 1.18 76.15 1.40 0.14 0.13 0.57 0.06 0.71 0.18 2.64 0.00 ALFR 64.3 1.3 71.5 0.3 2.3 0.9 3.2 1.5 5.5 2.4 - ALFR-e 66.0 0.4 72.9 1.0 4.7 1.8 4.7 1.7 9.4 3.4 - Debias 63.1 1.1 71.3 0.7 2.5 1.5 3.1 1.9 5.6 3.4 - Debias-e 65.6 2.4 72.9 1.2 5.3 0.9 3.1 1.3 8.4 2.2 - FCGE 66.0 1.5 73.6 1.5 2.9 1.0 3.0 1.2 5.9 2.2 - Fair GNN 68.39 3.12 74.29 1.19 2.81 4.01 3.00 4.07 5.81 8.08 2.64 0.00 Fair AC (Ours) 66.51 1.09 75.69 1.31 0.09 0.08 0.10 0.00 0.19 0.08 2.64 0.00

GCN 65.10 0.24 68.42 0.12 1.72 1.17 1.37 0.51 3.08 1.68 41.35 0.01 ALFR 65.4 0.4 71.3 0.3 2.8 0.5 1.1 0.4 3.9 0.9 - ALFR-e 68.0 0.6 74.0 0.7 5.8 0.4 2.8 0.8 8.6 1.2 - Debias 65.2 0.7 71.4 0.6 1.9 0.6 1.9 0.4 3.8 1.0 - Debias-e 67.5 0.7 74.2 0.7 4.7 1.0 3.0 1.4 7.7 2.4 - FCGE 65.9 0.2 71.0 0.2 3.1 0.5 1.7 0.6 4.8 1.1 - Fair GNN 68.16 0.59 75.67 0.52 1.56 0.45 3.17 1.07 4.73 1.47 41.35 0.01 Fair AC (Ours) 65.33 0.30 71.20 1.74 0.55 0.10 0.13 0.15 0.68 0.09 41.33 0.00

GCN 67.88 1.46 72.86 1.44 3.22 1.29 5.93 2.76 9.15 4.05 45.93 0.00 ALFR 63.1 0.6 67.7 0.5 3.05 0.5 3.9 0.6 3.95 1.1 - ALFR-e 66.2 0.4 71.9 1.0 4.1 1.8 4.6 1.7 8.7 3.5 - Debias 62.6 1.1 67.9 0.7 2.4 1.5 2.6 1.9 5.0 3.4 - Debias-e 65.6 2.4 71.7 1.2 3.6 0.9 4.4 1.3 8.0 2.2 - FCGE 64.8 1.5 69.5 1.5 4.1 1.0 5.5 1.2 9.6 2.2 - Fair GNN 67.06 0.37 71.58 2.58 0.55 0.50 0.30 0.20 0.85 0.31 45.93 0.00 Fair AC (Ours) 67.00 1.93 72.57 1.68 0.11 0.06 0.47 0.81 0.58 0.76 45.94 0.02

Table 1: Comparison of Fair AC with Fair GNN on the nba, pokec-z and pokec-n dataset. The methods are applied on the GCN classifier. The best results are denoted in bold. The ALFR, ALFR-e, Debias, Debias-e and FCGE baselines are adapted from Guo, Chu, and Li [1].

In addition to the main comparison of different baselines, an ablation study on different attribute missing rates (α) was done in the original paper to verify claim 4. The results are shown in Table 2. The trends in the reproduced results are very similar to the original trends, as Fair AC performs best on fairness for different α. However, for α = 0.1, it is observed that in the reproduced results, Base AC, which is the same architecture as Fair AC but without adversarial training, performs slightly better than Fair AC. In addition to this, it is observed that adversarial training is more useful with a large attribute missing rate, an observation that wasn t made in the original paper.

Published in Transactions on Machine Learning Research (06/2024)

Although the original results are reproduced, the reproduction would not have been possible without access to the codebase. A lot of the training details, such as hyperparameters and pretraining of the auto-encoder were not mentioned in the paper, only in the codebase. In addition to this, the random seeds used in the original study were only retrieved after contact with the authors.

α Method Acc AUC SP EO SP+ EO Consistency

GCN 65.93 0.33 69.86 1.73 3.52 2.94 2.73 3.18 6.26 6.12 41.33 0.00 Fair GNN 68.67 2.52 76.95 0.53 1.97 0.52 2.23 0.31 4.20 0.59 41.33 0.00 Base AC 66.16 0.11 69.22 0.07 0.08 0.01 0.40 0.20 0.48 0.21 41.33 0.00 Fair AC 66.04 0.69 70.72 1.50 0.26 0.28 0.50 0.62 0.76 0.90 41.33 0.00

GCN 65.10 0.24 68.42 0.12 1.72 1.17 1.37 0.51 3.08 1.68 41.35 0.01 Fair GNN 68.16 0.59 75.67 0.52 1.56 0.45 3.17 1.07 4.73 1.47 41.35 0.01 Base AC 66.26 0.32 71.11 1.73 0.35 0.15 0.73 1.18 1.08 1.32 41.33 0.00 Fair AC 65.33 0.30 71.20 1.74 0.55 0.10 0.13 0.15 0.68 0.09 41.33 0.00

GCN 65.65 0.97 69.72 2.15 2.75 2.09 3.50 3.32 6.25 5.40 41.38 0.02 Fair GNN 65.97 0.56 72.99 0.44 2.40 1.44 3.07 2.41 5.47 3.84 41.37 0.01 Base AC 65.45 0.40 70.64 1.10 0.13 0.20 0.33 0.58 0.47 0.77 41.33 0.00 Fair AC 65.62 0.02 71.11 1.02 0.05 0.05 0.30 0.52 0.35 0.50 41.33 0.00

GCN 65.37 1.30 71.62 2.33 5.10 3.12 5.53 3.86 10.64 6.84 41.47 0.04 Fair GNN 63.81 0.50 67.57 0.30 2.96 0.28 1.77 1.33 4.73 1.10 41.44 0.02 Base AC 66.03 0.69 71.06 1.21 0.32 0.39 0.67 0.65 0.99 0.86 41.33 0.00 Fair AC 65.38 0.14 71.51 0.68 0.23 0.37 0.03 0.06 0.27 0.43 41.33 0.00

Table 2: Comparison of Fair AC with Fair GNN on the pokec-z dataset with different attribute missing rates (α). The methods are applied on the GCN classifier. The best results are denoted in bold.

Furthermore, a hyperparameter study of β was done in the original paper. β balances fairness and accuracy (details in Section 3.1). The original study introduced adversarial training to improve group fairness and claimed that adversarial training is necessary for a good performance on group fairness, based on β hyperparameter study. It reported that increasing β improves fairness but slightly reduces accuracy by about 0.5%. However, the lack of standard deviation in the original figure makes it unclear whether the experiment was done three times or one. In this study, the experiment is done three times. The mean and standard deviation are shown in Figure 2. In general, the same trends are observed. Notably, at β = 0.8, we observed a marginal increase in fairness, contrary to the original study s continuous decline. From this, it can be concluded that adversarial learning helps fairness performance in the Fair AC framework.

Figure 2: Hyperparameter value study for β, which influences the trade-off between accuracy and fairness.

4.2 Results beyond original paper

In the original study, Fair AC was presented as a generic method applicable to a wide range of downstream tasks. This claim was primarily supported by tests on different datasets. Therefore, we performed additional experiments to test this claim. Firstly, our study focused on examining Fair AC s performance across various sensitive attributes. Unlike the original experiment that only considered region as the sensitive variable, in

Published in Transactions on Machine Learning Research (06/2024)

this study age and gender were used as sensitive variables. The findings, detailed in Table 3, show that gender as sensitive attribute produces results comparable to those obtained with region. However, for age, the fairness performance decreases drastically. In this scenario, the standard GCN model outperforms both fair models. This outcome is due to age being an important feature for the final prediction tasks, which challenges the achievement of set accuracy and AUC thresholds. If the thresholds are set lower, the accuracy decreases but the fairness increases. Experiments on different thresholds values can be found in Appendix E.

Sensitive attribute Method Acc AUC SP EO SP+ EO Consistency

GCN 65.10 0.24 68.42 0.12 1.72 1.17 1.37 0.51 3.08 1.68 41.35 0.01 Fair GNN 68.16 0.59 75.67 0.52 1.56 0.45 3.17 1.07 4.73 1.47 41.35 0.01 Fair AC 65.33 0.30 71.20 1.74 0.55 0.10 0.13 0.15 0.68 0.09 41.33 0.00

GCN 63.40 0.20 68.56 0.40 3.28 0.71 2.97 0.61 6.24 1.13 41.35 0.01 Fair GNN 64.25 0.41 72.25 2.49 2.27 0.95 2.63 0.31 4.90 0.77 41.35 0.01 Fair AC 66.44 0.47 73.39 0.20 0.66 0.56 0.30 0.10 0.96 0.52 41.33 0.00

GCN 64.94 1.11 71.33 1.94 24.69 3.21 20.57 3.84 45.26 6.96 41.35 0.01 Fair GNN 65.79 0.20 72.53 1.42 39.83 4.80 37.23 2.20 77.07 6.70 41.35 0.01 Fair AC 65.82 0.69 74.26 0.42 27.46 1.94 19.90 2.52 47.36 4.38 41.33 0.00

Table 3: Comparison of Fair AC with Fair GNN and GCN on the pokec-z dataset with different sensitive attributes. The methods are applied on the GCN classifier. The best results are denoted in bold.

The Fair AC framework s performance was evaluated on two additional datasets beyond those used in the original study, with results detailed in Table 4. In the recidivism dataset, the fairness results are almost perfect across all methods. While Fair GNN marginally outperforms Fair AC in group fairness, Fair AC performs better in individual fairness. The credit dataset shows a similar pattern, with Fair AC achieving the best fairness performance among the methods tested, while maintaining comparable overall performance. These outcomes lend further support to the claim that Fair AC provides better fairness results than comparable methods and thus support the claim that Fair AC is a generic method. Moreover, we extended our analysis by incorporating an additional metric, consistency, as explained in Section 3.4.

Contrary to the common trade-off between group and individual fairness [16], our findings (see Table 1) indicate similar consistency levels across GCN, Fair GNN and Fair AC. This suggests that improving group fairness does not necessarily compromise individual fairness.

Dataset Method Acc AUC SP EO SP+ EO Consistency

GCN 72.16 4.00 64.80 0.45 7.43 0.92 5.97 0.95 13.40 1.46 26.46 0.27 Fair GNN 77.19 0.85 64.66 0.46 2.02 1.84 0.83 1.10 2.85 2.91 28.02 1.41 Fair AC 69.78 2.94 65.13 0.07 0.68 0.51 0.50 0.61 1.18 0.29 27.24 0.81

GCN 62.37 0.02 63.01 2.43 0.01 0.02 0.07 0.12 0.08 0.14 3.93 0.00 Fair GNN 70.00 0.00 56.66 1.46 0.00 0.00 0.00 0.00 0.00 0.00 6.60 0.00 Fair AC 63.03 1.17 70.32 13.02 0.04 0.08 0.00 0.00 0.04 0.08 3.93 0.01

Table 4: Comparison of Fair AC with Fair GNN and GCN on the pokec-z dataset with different sensitive attributes. The methods are applied on the GCN classifier. The best results are denoted in bold.

5 Discussion

In this study, various experiments on reproducing the study Fair Attribute Completion on Graph with Missing Attribute [1] were presented. The results of these experiment support the original claims of the authors. Specifically, Fair AC has a better fairness performance than other graph attribute completion methods, while maintaining a comparable accuracy, even when a large part of the attributes is missing. The method uses adversarial learning, which indeed improves the performance. To support the claim that Fair AC is a generic method, various additional experiments were performed. From these experiments, it can be concluded that

Published in Transactions on Machine Learning Research (06/2024)

Fair AC is applicable across multiple datasets in different fields. Also, Fair AC gives accurate results for various sensitive attributes, while the performance might drop if the attribute is important for the downstream task. To examine the impact of Fair AC on individual fairness, an additional metric was implemented, from which can be concluded that there is almost no trade-off between group fairness and individual fairness in this study. Since this is contrary to most findings in literature [16], for future research it would be interesting to look into the causes of a trade-off in individual fairness and group fairness. The code implementation of this study, together with the refactored code from the Fair AC paper, can be found on Git Hub4.

5.1 What was easy and what was difficult

The original paper provides a complete codebase of Fair AC. This was very helpful, since the codebase also includes the hyperparameters utilized in the experiments of the original paper. This made running the experiments easier. However, understanding the code turned out to be a non-trivial task. A lot of training and implementation details were not mentioned in the paper and the code was not clearly structured or documented. This made it very difficult to adjust the codebase for additional experiments, therefore we refactored all original code. In addition to this, the GCN and Fair GNN baselines used in the original paper were not part of the codebase, so these were adapted from the Fair GNN codebase and changed to meet the requirements of the Fair AC paper.

5.2 Communication with original authors

We originally reached out the authors to ask for clarification on the setup on some of their experiments, namely the random seeds used and the exact setup for the Base AC experiment, since the description of this experiment was inconsistent with the codebase. We received a very quick clear response to these questions. Later in the project, we reached out again to ask some questions regarding ambiguities in the data handling and splitting that was taking place in the code. We received a short response with some clarification regarding the asked questions.

[1] D. Guo, Z. Chu, and S. Li, Fair attribute completion on graph with missing attributes, in The Eleventh International Conference on Learning Representations, 2023. [Online]. Available: https:// openreview.net/forum?id=9vc XCMp9VEp. [2] J. Zhou, G. Cui, S. Hu, et al., Graph neural networks: A review of methods and applications, AI open, vol. 1, pp. 57 81, 2020. [3] D. Jin, Z. Liu, W. Li, D. He, and W. Zhang, Graph convolutional networks meet markov random fields: Semi-supervised community detection in attribute networks, vol. 33, pp. 152 159, Jul. 2019. doi: 10.1609/aaai.v33i01.3301152. [Online]. Available: https://ojs.aaai.org/index.php/AAAI/ article/view/3780. [4] T. N. Kipf and M. Welling, Semi-supervised classification with graph convolutional networks, 2017. ar Xiv: 1609.02907 [cs.LG]. [5] T. N. Kipf and M. Welling, Variational graph auto-encoders, 2016. ar Xiv: 1611.07308 [stat.ML]. [6] M. Zhang, Z. Cui, M. Neumann, and Y. Chen, An end-to-end deep learning architecture for graph classification, in Proceedings of the Thirty-Second AAAI Conference on Artificial Intelligence and Thirtieth Innovative Applications of Artificial Intelligence Conference and Eighth AAAI Symposium on Educational Advances in Artificial Intelligence, New Orleans, Louisiana, USA: AAAI Press, 2018, isbn: 978-1-57735-800-8. [7] E. Dai and S. Wang, Say no to the discrimination: Learning fair graph neural networks with limited sensitive attribute information, 2021. ar Xiv: 2009.01454 [cs.LG]. [8] A. Beutel, J. Chen, Z. Zhao, and E. H. Chi, Data decisions and theoretical implications when adversarially learning fair representations, 2017. ar Xiv: 1707.00075 [cs.LG].

4https://github.com/oxkitsune/fact

Published in Transactions on Machine Learning Research (06/2024)

[9] C. Dwork, M. Hardt, T. Pitassi, O. Reingold, and R. Zemel, Fairness through awareness, 2011. ar Xiv: 1104.3913 [cs.CC]. [10] X. Chen, S. Chen, J. Yao, H. Zheng, Y. Zhang, and I. W. Tsang, Learning on attribute-missing graphs, IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 44, no. 2, pp. 740 757, Feb. 2022, issn: 1939-3539. doi: 10.1109/tpami.2020.3032189. [Online]. Available: http: //dx.doi.org/10.1109/TPAMI.2020.3032189. [11] T. Rahman, B. Surma, M. Backes, and Y. Zhang, Fairwalk: Towards fair graph embedding, 2019. [12] S. Garg, R. Mujumdar, and R. Gajawada, Exploring fairness in heterogenous graph embeddings, 2021. [Online]. Available: https://rohitmujumdar.github.io/projects/WSTM_Final_Report.pdf. [13] Z. Jiang, X. Han, C. Fan, et al., Topology matters in fair graph learning: A theoretical pilot study, 2023. [Online]. Available: https://openreview.net/forum?id=TVMjn0Rp LHf. [14] L. Takac and M. Zábovský, Data analysis in public social networks, International Scientific Conference and International Workshop Present Day Trends of Innovations, pp. 1 6, Jan. 2012. [15] C. Agarwal, H. Lakkaraju, and M. Zitnik, Towards a unified framework for fair and stable graph representation learning, in Uncertainty in Artificial Intelligence, PMLR, 2021, pp. 2114 2124. [16] R. Binns, On the apparent conflict between individual and group fairness, in Proceedings of the 2020 conference on fairness, accountability, and transparency, 2020, pp. 514 524. [17] M. Hardt, E. Price, and N. Srebro, Equality of opportunity in supervised learning, 2016. ar Xiv: 1610. 02413 [cs.LG]. [18] P. Xu, Y. Zhou, B. An, W. Ai, and F. Huang, Gfairhint: Improving individual fairness for graph neural networks via fairness hint, 2023. ar Xiv: 2305.15622 [cs.LG]. [19] Y. Dong, J. Ma, S. Wang, C. Chen, and J. Li, Fairness in graph mining: A survey, IEEE Transactions on Knowledge and Data Engineering, 2023. [20] B. Perozzi, R. Al-Rfou, and S. Skiena, Deepwalk: Online learning of social representations, Co RR, vol. abs/1403.6652, 2014. ar Xiv: 1403.6652. [Online]. Available: http://arxiv.org/abs/1403.6652.

Published in Transactions on Machine Learning Research (06/2024)

A Model details

In this section the components of the loss function are described in more detail. As explained in Section 3, the loss function used by Fair AC consists of three components: a feature fairness loss (LF ), completion loss (LC) and topological fairness loss (LT ), with a parameter β to control the influence of the sensitive classifier:

L = LF + LC + βLT (4)

Feature fairness. The feature fairness loss consist of two components, an auto-encoder loss and sensitive classifier loss. The feature fairness loss leverages the sensitive classifier Cs to adversarially train the autoencoder, such that the encoder is able to generate fair feature embeddings that can fool the sensitive classifier.

LF = Lae βLCs (5)

Auto-Encoder. The auto-encoder encodes original attributes Xi to feature embeddings Hi and reconstructs the attributes ˆ Xi from latent space. The reconstructed attributes should be close to the original attributes. The loss optimizes for this:

Lae = 1 |Vkeep|

r ˆ Xi Xi 2 (6)

Sensitive classifier. The sensitive classifier takes feature embeddings Hi as input and predicts the sensitive attribute ˆsi. Since the sensitive attribute is binary, binary cross entropy loss is used:

LCs = 1 |Vkeep|

i Vkeep si log ˆsi + (1 si) log(1 ˆsi) (7)

Attribute completion. The attribute completion loss measures the difference between the true feature embeddings and the predicted feature embeddings across nodes whose attributes are intentionally dropped (Vdrop). This loss is averaged over all such nodes:

LC = 1 |Vdrop|

( ˆHi Hi)2. (8)

Topological fairness. The attribute completion may introduce topological unfairness by assuming topological information is similar to attributes. To address this issue, Fair AC leverages the sensitive classifier (Cs) to help mitigate this unfairness. The loss function aims to learn feature embeddings that can fool the sensitive classifier (Cs) to predict uniformly over the sensitive category:

LT = 1 |Vdrop|

i Vdrop si log ˆsi + (1 si) log(1 ˆsi). (9)

Published in Transactions on Machine Learning Research (06/2024)

Dataset NBA Pokec-z Pokec-n Credit Recidivism

# of nodes 403 67,797 66,569 30,000 18,876 # of edges 16,570 882,765 729,129 200,526 321,308 Density 0.20456 0.00038 0.00032 0.00045 0.001804 Sensitive attributes country region region age race Predicted labels salary working field working field no default next month bail vs. no bail

Table 5: Statistics of five graph datasets, namely nba, pokec-z, pokec-n, german credit and recidivism.

In Table 5, the statistics for all five datasets used in this study are given. There is a big difference in size between all datasets. Pokec-z and Pokec-n are the largest dataset, followed by Credit, Recidivism and NBA, respectively. NBA is a dataset with only a small number of nodes. However, the density of NBA is a lot larger than the density of all other dataset. Lastly, the default sensitive attributes are displayed, together with the node labels that are predicted in the downstream node prediction task.

C Model training details

Algorithm 1 Model training

Input: G = (V, E, X), S Output: auto-encoder f AE, Sensitive classifier Cs, Attribute completion f AC

1: Obtain topological embedding T with Deep Walk 2: repeat Pre-train f AE 3: Obtain the feature embeddings H with f AE 4: Optimize f AE to prevent unstable embeddings by Lae + LC 5: until 200 iterations 6: repeat Train f AE, Cs and f AC using downstream task 7: Obtain the feature embeddings H with f AE 8: Optimize the Cs by LCs 9: Optimize f AE to mitigate feature unfairness by loss LF 10: Divide V+ into Vkeep and Vdrop based on α

11: Obtain the feature embeddings of nodes with missing attributes Vdrop by f AC 12: Optimize f AC to achieve attribute completion by loss LC 13: Optimize f AC to mitigate topological unfairness by loss LT 14: until convergence

In Algorithm 1, the model training process is shown. First of all, the auto-encoder is pre-trained for 200 epochs. After this, the auto-encoder is optimized, together with the sensitive classifier and the attribute completion mechanism. More details can be found in Section 3.5.1.

The source is published at https://github.com/oxkitsune/fact. A complete refactor of the original codebase5 has been completed in order to make the framework significantly easier to apply on various downstream tasks. The published code can be used as a library, with any GNN. The GNN only requires a

5https://github.com/donglgcn/Fair AC

Published in Transactions on Machine Learning Research (06/2024)

slight modification, as the final layer needs to be excluded in order to train the Fair AC model. For more details see the README.md file in the repository.

E Fairness performance on age sensitive attribute with different thresholds

Threshold Method Acc AUC SP EO SP+ EO Consistency

65% acc threshold 69% auc threshold

GCN 64.94 1.11 71.33 1.94 24.69 3.21 20.57 3.84 45.26 6.96 41.35 0.01 Fair GNN 65.79 0.20 72.53 1.42 39.83 4.80 37.23 2.20 77.07 6.70 41.35 0.01 Fair AC 65.82 0.69 74.26 0.42 27.46 1.94 19.90 2.52 47.36 4.38 41.33 0.00

60% acc threshold 64% auc threshold

GCN 61.85 2.57 70.52 0.80 18.55 1.91 14.57 1.25 33.12 2.29 41.35 0.01 Fair GNN 59.77 0.29 69.36 3.51 19.66 8.48 17.37 5.34 37.03 13.82 41.35 0.01 Fair AC 61.27 0.84 73.59 0.85 17.14 2.77 12.07 2.44 29.21 5.13 41.33 0.00

50% acc threshold 50% auc threshold

GCN 53.60 0.06 55.34 1.86 0.06 0.06 0.03 0.06 0.09 0.08 41.35 0.01 Fair GNN 53.59 0.00 54.29 2.23 0.00 0.00 0.00 0.00 0.00 0.00 41.35 0.01 Fair AC 53.59 0.00 54.32 1.96 0.00 0.00 0.00 0.00 0.00 0.00 41.33 0.00

Table 6: Comparison of Fair AC with Fair GNN and GCN on the pokec-z dataset with age as sensitive attribute for different accuracy and auc thresholds. The methods are applied on the GCN classifier. The best results are denoted in bold.

In Table 6, the experiments with different thresholds for the sensitive attribute age are displayed. In the top row, the results of the original threshold are shown. As one can see and as was analysed in Section 4.2, the model embeddings are a lot more unfair than for other sensitive attributes, respectively 47.36 versus 0.96. As the thresholds decrease, the embeddings get more fair, until the point where they reach 0.0. This means that the model is able to produce fair embeddings, but for important attributes, it can cost a lot of performance.

F GPU usage

Experiments #Fair AC models #Fair GNN models #GNN models #datasets #seeds GPU usage (hours)

Main (Table 1) 1 1 1 3 3 7.5

Alpha (Table 2) 8 4 4 1 3 16

Beta (Figure 2) 5 0 0 1 3 7.5

Sensitive attributes (Table 3, Table 6) 4 4 4 1 3 10

Other datasets (Table 4) 1 1 1 2 3 5

Table 7: GPU usage for all experiments published in this report. Every Fair AC model takes roughly 30 minutes to train, every GNN and Fair GNN model takes about 10 minutes to train.

In Table 7, the specifics of the GPU hours used per experiment are displayed. The alpha experiments cost the most GPU hours, since all models, including Fair AC without adverserial training, had to be trained for every alpha.

G Hyperparameters

For training the models, various hyperparameters were set. All hyperparameters used in this study were adapted from the original study. Per default, all models were trained for 3000 epochs, which includes 200 epochs pretraining of the auto-encoder. An initial learning rate of 0.001 was used. On this learning rate, weight decay of 1 10 5 is applied. While training, a dropout of 0.5 is used. The main auto-encoder has a

Published in Transactions on Machine Learning Research (06/2024)

hidden dimension of 128 and uses one attention head. The default feature drop rate (α) was set to 0.3, unless mentioned otherwise. β was set to 1.0 for all datasets, expect for the pokec-n dataset, where it was set to 0.5. As an accuracy threshold, 65.0 was used and for the auc threshold, 69.0 was used. The hyperparameters used for all dataset are shown in Table 8. For the dataset that were not used in the original study, the hyperparameters were adapted from Dong, Ma, Wang, et al. [19].

Dataset Minimum number of datapoints used for AC training Minimum number of datapoints used for sensitive classifier training

Pokec-n 500 200

Pokec-z 500 200

Credit 6000 500

Recidivism 200 100

Table 8: Hyperparameters specific to the data loading for all datasets used in this study.

In addition to these hyperparameters, the GNN and Fair GNN model had to be pretrained. The pretraining was done for 800 epochs. For the pretraining, all other hyperparameters are equal to the normal training hyperparameters.

Fair AC uses topological input embeddings which are created using Deepwalk [20]. This embedding was created using 10 epochs with a walk length of 100 and a window size of 5. 4 workers were used, and the dimension adapted was 64. Lastly, a learning rate of 0.05 was adapted.